



METHODS OF 
CORRELATION ANALYSIS 





I A V 

METHODS 

COEEELATION ANALYSIS 


BY 

MORDECAI EZEKIEL 

Chief, Agricultural-Industrial Relaiions Branch 
Food and Agriculture Organization of the United Nations 
F ellow of the American Statistical Association 
Fellow of the Econometric Society 


SECOND EDITION 

Ninth Printing 


IIA Lib., 


NEW YORK 

JOHN WILEY & SONS, Inc. 

London: CHAPMAN & HALL, Limited 



COPTBIGHT, 1930, 1941 

BY 

Mokdecai Ezekiel 

All Rights Reserved 

This hook or any part thereof muat not 
he reproduced in any form without 
the written permission of the publisher, 

SECOND EDITION 

Ninth Pnntiug, July, 1950 


PRIMTED IN U. S. A, 



PREFACE TO SECOND EDITION 


Twice since the first edition of Methods of Correlation Analysis 
appeared there have been reprintings in which minor errors in com- 
putations or typography Avere corrected. Now, a decade after the 
publication of the first edition, I am making the first general revision. 

There have been many refinements and developments in the appli- 
cation of correlation methods to social and economic data during this 
period, and a beginning has been made in their application to engineer- 
ing and other technological problems. The general technique has been 
but little changed during the period, and the main body of methods 
still seems useful. The major changes during the decade have been, 
first, in the interpretation of the meaning of standard errors and, sec- 
ond, in the application of logical limitations to the flexibility of graphic 
curves. Other significant developments have been in the perfection of 
new and speedier methods of calculation and in the development of 
methods of estimating the reliability of an individual estimate or fore- 
cast. All these are covered in this revision. 

One completely new chapter has been added to this edition. That 
is Chapter 19, dealing with the reliability of an individual forecast and 
also with the applicability of error formulas to time series. The con- 
clusion is reached there that these formulas are more serviceable in 
connection with time series than has generally been believed. Chap- 
ter 16, dealing with the short-cut (Bean) method of graphic correlation, 
has been almost entirely rewritten and materially enlarged. Increased 
emphasis is placed iiiion the precautions which need to be taken to get 
deiiendable results by this method and upon the way in which logical 
analysis should be used to place limitations upon the shape of the 
curves fitted, and thus prevent undue flexibility in their fitting. The 
chapters dealing with sampling theory. Chapter 2 for means and Chap- 
ter 18 for correlation results, have been materially revised to bring the 
explanation of the significance of standard error computations up to 
the modern interpretation. The section on the sampling significance of 
graphic regression curves lias been moved from the technical appendix 
to this section and has also been materially expanded, with fuller illus- 
trations. After a decade of use, it is now believed that this technique 



yi 


PREFACE TO SECOND EDITION 


provides a valuable check on the significance of graphic regression and 
net regression curves. 

Other chapters have been less extensively revised. Chapter 23, on 
examples of correlation applications, has been briefly brought up to 
date. One time-series analysis has been extrapolated to date in Chap- 
ter 14. A new explanatory example, which it is believed will aid the 
student in comprehending the meaning of partial regression coefficients, 
has been added at the beginning of Chapter 10; and Chapter 11 has 
been expanded somewhat. Although the analysis of variance is intro- 
duced here, no attempt is made to provide a complete treatment for it, 
as it was felt to lie outside the major field of this book. Chapters 7, 
13, and 15, dealing with the measurement of standard error of estimate 
and degree of correlation, have also been revised to state more pre- 
cisely the meaning of the adjustment of the crude coefficients to obtain 
unbiased estimates of the probable value in the universe. Other chap- 
ters have been corrected or expanded in various details. The appen- 
dix on methods of computation has been expanded to cover the most 
expeditious methods of computing partial correlation coefficients, the 
standard error of an individual forecast, and of making graphic trans- 
fers in the graphic short-cut method; and the explanations on the charts 
in Appendix 3 have been modified in line with the changes in Chap- 
ters 2 and 18. 

With respect to the perennial debate as between the use of elaborate 
mathematical curves or transformations or the use of freehand curves 
in representing curvilinear regressions, my basic position remains un- 
changed in favoring freehand curves unless there are logical reasons 
for the selection of a particular mathematical equation. Much more 
attention is given to the logical meaning of freehand curves, however, 
and to the use of logical limitations in drawing in the curves. As be- 
fore, the techniques for both methods are described and illustrated. 
The cross-referencing from one method to the other, and the discussion 
of the proper place for each, has also been somewhat expanded. 

To aid instructors and others who may wish to use this revised 
edition along with the old, the table numbers have been left unchanged 
throughout the body of the book, new tables being designated by an A 
or B after the number. Figure numbers similarly are left unchanged 
up to Chapter 16, where the considerable number of new figures added 
made it seem better to begin renumbering. Equation numbers have 
been left unchanged throughout most of the body of the book, equations 
being renumbered only from Chapter 21 on. Prior to that point, equa- 
tions numbered with whole numbers stand exactly as in the first edi- 



PREFACE TO SECOND EDITION 


vii 


tion; when the previous equations were changed or new equations were 
added, they are numbered with decimal fractions. 

I hope that with these changes and additions the book will prove 
more useful than heretofore for classroom purposes and individual 
study. Naturally I am grateful that so specialized a book as this has 
found so wide an application in teaching and research, and I am always 
interested in hearing of applications of these methods to new fields. 

During recent years I have had to devote myself primarily to 
matters of economic policy and have not been able to follow the de- 
velopments in statistical methods as closely as during the period when 
this book was first taking shape. In preparing this revision I have had 
to lean heavily on the advice of those who in recent years have been 
closer to statistical teaching and practice than I have been myself. 
Valuable suggestions as to desirable revisions and new content have 
been received from Frederick V. Waugh, Charles F. Sarle, Elmer J. 
Working, Louis H. Bean, 0. C. Stine, and Clarence M. Purves. I am 
indebted to my first teacher, Howard R. Tolley, for many suggestions 
noted during the period he was using the book for classroom teaching 
at the University of California. In addition, much of the revision, 
especially in the more mathematical sections, has been guided by the 
advice of two expert mathematical statisticians, W. Edwards Deming 
and Meyer A. Girsliick. I am deeply indebted to them both for helpful 
suggestions and criticisms and for reading much of the revised manu- 
script, especially the sections dealing with the sampling significance of 
results. The increased precision and clarity of these sections are 
largely attributable to their aid. R. G. Hainsworth has again helped 
me with the figures, maintaining consistency with the excellence of 
those he prepared for tlie first edition. Any errors or misstatements 
remain my own responsibility, and not that of those who have aided 
with suggestions or criticisms. 

To these and to many others who, over the years, have called my 
attention to errors or suggested revisions I express my appreciation 
and gratitude. 

Although the new material has been carefully checked, some errors 
of computation or notation have no doubt crept in. Again I shall be 
grateful if any student or reader will inform me of any such errors he 
notices. 

Mordecai Ezekiel 

Washington, D. C. 

June 15, lOJil 




PEEFACB TO FIEST EDITION" 

This book is not intended to cover the entire field of statistics, but 
rather, as its name indicates, that part of the field which is con- 
cerned with studying the relations between variables. The first two 
chapters are devoted to a brief review of the central elements in the 
measurement of variability in a statistical series, and to the essential 
concepts in judging the reliability of conclusions. These chapters 
are not to be regarded as a full statement, but instead as brief sum- 
maries to clarify the basic ideas which are involved in the subsequent 
development. 

No attempt is made in the body of the text to present the mathe- 
matical theory on which the art of statistical analysis is based. In- 
stead, the aim throughout has been to show how the various methods 
may be emi^loyed in practical research work, what their limitations 
are, and what the results really mean. Only the simplest of algebraic 
statements have been emi)loycd, and the practical procedure for each 
operation has been worked out step by step. It is believed that the 
material will be readily comprehensible to anyone who has had courses 
in elementary algebra. 

Although the exaiu])les which are used in presenting the several 
metliods are drawn very largely from the author’s own field of agri- 
cultural economics, the methods themselves are explained in suffi- 
ciently general terms so that they can be api)lied in any field. In 
addition, two chapttM’s arc devoted to a discussion of the types of 
problems in n, great many different fields of work to wdiieli correlation 
analysis has b('(‘n snceessfully ai)plicd, and to research metliods and 
the ])lacc of correlation analysis in research. It is hoj'jed that this 
presentation will assist rescarcli workers in many fields to appreciate 
both the possibilities and the limitations of correlation analysis, and 
so gain from their data knowledge of ail the relations which so fre- 
quently lie hidden beneath the surface. 

Where tln^ methods i)resente(l are the well-established ones devel- 
oped by the fathers of the modern science, mainly the English statisti- 
cians, no -attempt is made to prove or derive the various formulas. 
On a few crucial points, however, or where derivations not generally 


IX 



X 


PREFACE TO FIRST EDITION 


accessible are involved, the derivations of the formulas are shown in 
notes in the technical appendix, in the simplest manner possible. 

The methods presented in this book, insofar as they constitute 
an advance over those previously available, represent largely the 
joint product of a group of young researchers in the Bureau of 
Agricultural Economics of the United States Department of Agricul- 
ture during the past decade. The new methods include (a) the appli- 
cation of the Doolittle method to the solution of multiple correlation 
problems, greatly reducing the labor of obtaining multiple correlation 
results, and making feasible the use of multiple correlation in actual 
research work; (b) the development of approximate methods for 
determining curvilinear multiple correlations, and, more recently, 
very rapid graphic methods for their determination; (c) the recog- 
nition of ^^joint^^ correlation, and the gradual development of meth- 
ods of treating it; and (d) by extensive use in actual investigations, 
concrete demonstration of the possibilities of these methods in research 
work. These recent developments in correlation analysis are as yet 
largely unavailable except in the original articles in technical jour- 
nals. One object of this book is to present them in organized form, 
and with such interpretation that their significance and application 
may be fully understood. 

During the last two decades, the English statisticians “Student” 
and R. A. Fisher have been developing more exact methods of judg- 
ing the reliability of conclusions, particularly where those conclusions 
involve correlation or are based on small samples. These new meth- 
ods have as yet received but little recognition from American statisti- 
cians. They are presented here as simply as possible, and the dis- 
cussion of the reliability of conclusions gives them full consideration. 

So many persons have helped in the years during which this book 
has been growing that it is difGcult for me to enumerate them all. 
First of all I should like to mention Howard R. Tolley, from whom 
I received my introduction to statistics, and with whom it has been a 
constant joy to work. I give him credit for much that is included 
here. The very order of presentation reflects that which ho worked 
out for his classes. In a very real sense this book is a product of the 
spirit of research with which the Bureau of Agricultural Economics 
was imbued by the broad vision of Henry C. Taylor. John D. Black 
was the first to point out some of the undeveloped phases of statistical 
analysis, and then aided wdth encouragement and counsel in their 
solution. Bradford B. Smith aided in the beginning of the new devel- 
opments, and his vivid imagination and logical mind have been a 



PREFACE TO FIRST EDITION 


xi 


constant help. Among others who have collaborated in various stages, 
or who have independently worked out various phases of the problem, 
may be mentioned Sewall Wright, Donald Bruce, Fred Waugh, Louis 
Bean, and Andrew Court. Susie White, Helen L. Lee, and Della E. 
Merrick have given intelligent, conscientious, and loyal assistance in 
the clerical work in the development and testing of each new step. 

In the preparation of the book itself I have had generous and 
willing help. Dorothea Kittredge and Bruce Mudgett have given the 
very substantial assistance of a detailed reading of the entire text, 
and many improvements in presentation and in material are due to 
their suggestions. For two terms the mimeographed manuscript has 
been used as a text in the United States Department of Agriculture 
Graduate School, and the members of the class have helped me in 
working out the illustrations, in clarifying the text, and in eliminat- 
ing errors. R. G. Hainsworth, who prepared the figures, deserves 
credit for the excellence of the graphic illustrations. 0. V. Wells 
helped in computing many of the illustrative problems, and Cor- 
rine F. Kyle in verifying the arithmetic. For the laborious and 
exacting work of typing the preliminary stencils, the many re- 
visions, and the final manuscript, and for her care, patience, and 
suggestions, I am indebted to my mother, Rachel Brill Ezekiel; and 
for editing the manuscript and helping in the lengthy task of proof- 
reading, to my wife, Lucille Finsterwald Ezekiel. 

To all these, and to the many others who have helped me in the 
development of this work, I take this opportunity of expressing 
my obligation and my gratitude. 

For any errors in the statements made and in the theories ad- 
vanced, I alone am of course responsible. Although the text has been 
checked painstakingly, it is hardly to be hoped that a publication of 
this character will appear without some errors creeping in, in mathe- 
matics, in arithmetic, or in spelling. When such errors, or any 
ambiguities of statement, are noted by any reader, I would be very 
grateful if he would inform me of them. 

Mordecai Ezekiel. 

Wahhincton, D. C., 

April 20. 1930. 




CONTENTS 


CHAPTER 1 

MEASURING THE VARIABILITY OF A STATISTICAL SERIES 


PAGB 

1 


The Arithmetic Average 

Frequency Tables 

Average Deviation 

Standard Deviation 


4 

6 

8 


CHAPTER 2 


JUDGING THE RELIABILITY OF STATISTICAL RESULTS 14 

Assumptions in Sampling 15 

Computing the Standard Error 19 

Reliability of Small Samples 22 

Meaning and Use of the Standard Error 25 

Universes, Past and Present 30 


CHAPTER 3 

THE RET.ATION BETWEEN TWO VARIABT.ES, AND THE IDEA OF 


FUNCTION 34 

Relations between Variables 34 

Graphic Representation of Relation between Two Variables 36 

Expressing a Functional Relation Mathematically 30 

Determining a Functional Relation Statistically 42 


CPIAPTER 4 

DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 


ANOTHER CHANGES: (1) BY THE USE OF AVERAGES 47 

Independent and Dependent Variables 50 

Reliability of Group Averages 51 

Range within Which True Relation May Fall 56 

xiii 



XIV 


CONTENTS 


CHAPTER 5 

PAGE 

DETERMINING TEE WAY ONE VARIABLE CHANGES WITH AN-. 
OTHER: (2) ACCORDING TO THE STRAIGHT-LINE PUNCTION 59 


ii The Equation of a Straight Line 59 

■1; Fitting the Equation by Least Squares 64 

Interpreting the Linear Equation 71 


CHAPTER 6 

DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 


ANOTHER CHANGES: (3) FOR CURVILINEAR FUNCTIONS ... 75 

Different Types op Equations 7B 

Fitting a Simple Parabola S3 

Fitting a Cubic Parabola 89 

Fitting a Logarithmic Curve 93 

Expressing a Curvilinear Relation by a Free-hand Curve 105 

The Logical Significance op Mathematical Functions 113 

A Mathematical Equation Used in an Economic; Problem 121 

Limitations in Estimating One Variable from Known Values of 
Another 125 


CHAPTER 7 

MEASURING ACCURACY OF ESTIMATE AND DEGREE OF COR- 


RELATION r2vS 

The Closeness of Estimate — Standard Error of Kstima'I'k 12S 

For Linear Relations 129 

For Curvilinear Relations 131 

Adjustment of Standard Error of Estimate for NumhcM’ of Ohsc'rval ions 133 

The Relative Importance of the Relation — Correlation 139 

Linear — Coefficient of Correlation 137 

Curvilinear — Index of Correlation 13S 

Adjustments for Number of Observations Ml 


CHAPTER 8 

PRACTICAL METHODS FOR WORKING TWO-VARIABLE CORRE- 


LATION PROBLEMS 1.19 

Terms to be Used M9 

Working Out a Line.ar Correlation 147 

Interpreting the Results 151 

Working Out a Curvilinear Correlation 152 

Interpreting the Results I57 



CONTENTS 


XV 


CHAPTER 9 

PAGE 

THREE MEASURES OF CORRELATION : THE MEANING AND USE 
FOR EACH 159 

CHAPTER 10 

DETERMINING THE WAY ONE VARIABLE CHANGES WHEN TWO 
OR MORE VARIABLES CHANGE; (1) BY SUCCESSIVE ELIM- 
INATION 163 

Theoretical Example 164 

Practical Example 169 

Eliminating the Approximate Influence op One Variable 17? 

Eliminating the Approximate Influence of Both Variables 174 

Correcting Results by Successive Eliminations 176 

CHAPTER 11 

DETERMINING THE WAY ONE VARIABLE CHANGES WHEN TWO 
OR MORE OTHER VARIABLES CHANGE: (2) BY CROSS-CLASSI- 
FICATION AND AVERAGES 181 

Cross-classification for Three Variables 181 

Differences betw?;en Matched Sub-groups 185 

Limitations of Cross-classification for Many Variables 186 

CHAPTER 12 

DETERMINING THE WAY ONE VARIABLE CHANGES WHEN TWO 
OR MORE OTHER VARIABLES CHANGE: (3) BY USING A 
LINEAR REGRESSION EQUATION 190 

Determining a Regression Equation for Two Independent Variables 191 
Determining a Regression Equation for Three Independent Variables 198 
Determining the Regression Equation for any Number of Independent 

Variables 203 

Interpreting the Multiple RegrEvSsion Equation 205 

CHAPTER 13 

MEASURING ACCURACY OF ESTIMATE AND DEGREE OF COR- 
RELATION FOR LINEAR MULTIPLE CORRELATION 208 

Standard Error of Estimate 20S 

Multiple Correlation 210 

Measuring the Separate EffFvCT of Individual Variables 213 

Partial C>)URet.ation 213 

“Beta” Coefficients 217 



XVI 


CONTENTS 


CHAPTER 14 

PAGE 

DETERMINING THE WAY ONE VARIABLE CHANGES WHEN TWO 
OR MORE OTHER VARIABLES CHANGE: (4) USING CURVI- 
LINEAR REGRESSIONS 220 

Multiple Regression Curves Mathematically Determiner 221 

Multiple Regression Curves by Successive Approximations 222 

Determining the First Approximation Net Regression Curves 228 

Estimating Xi from the First Approximation Curves 235 

Determining the Second Approximation Regression Curves 239 

Estimating Xi from the Second Approximation Citkyes 243 

Correcting the Curves by Further Successive Approximations 247 

Stating the Final Conclusions 247 

Limitations on the Use of the Results ' 254 

A Test in Actual Forecasting op Yield 255 

Reliability op Regression Curves 258 

CHAPTER 15 

MEASURING ACCURACY OF ESTIMATE AND DEGREE OF COR- 
RELATION FOR CURVILINEAR MULTIPLE CORRELATION ... 259 

Standard Error of Estimate 259 

Index of Multiple Correlation 264 

Measuring the Net Curvilinear Importance of Individual Factors 267 

CHAPTER 16 

SHORT-CUT METHODS OF DETERMINING NET REGRESSION 
LINES AND CURVES 268 

Linear Net Regressions 269 

The Short-cut Method Applied to Curvilinear Regressions 277 

Identifying Joint Relations by the Short-cut I^kocess 296 

Application of the Short-cut Method to Large Samples 298 

CHAPTER 17 

MEASURING THE WAY A DEPENDENT VARIABLE CHANGES 
WITH CHANGES IN A NON-QUANTITATIVE INDEPI^NDENT 
FACTOR 302 

Eliminating the Influence op Other Varubles 302 

Determining the Net Influence of the New Variabi.e 305 

Making Further Successive Approximations 308 


CONTENTS 


XVll 


CHAPTER 18 

PAGE 

DETERMINING THE RELIABILITY OF CORRELATION CONCLU- 
SIONS 312 

Simple Correlation 312 

Regression Coefficients 312 

Correlation Coefficients 318 

Correlation Indexes 320 

Multiple Correlation 321 

Coefficients of Multiple Correlation and Net Regression 321 

Multiple Curvilinear Correlation 327 

CHAPTER 19 

THE RELIABILITY OF AN INDIVIDUAL FORECAST AND OF TIME- 
SERIES ANALYSES 341 

Reliability of an Individual Forecast 341 

Simple Correlation 342 

Multiple Correlation 344 

Extrapolation of a Regression Equation beyond the Observed Range . . . 347 

Error Formulas for Time Series 349 

Practical Procedures for Judging Reliability of Forecasts 356 

CHAPTER 20 

INFLUENCE OF SELECTION OF SAMPLE AND ACCURACY OF 
OBSERVATIONS ON CORRELATION RESULTS 359 

Selection of Sample 359 

With Respect to Values of the Independent Variable 360 

With Respect to Values of the Depend cut Factor 361 

With Reference to Values of Both Variables 362 

Accuracy of Observations 364 

Errors in the Dependent Variable 365 

Errors in the Independent Variable 366 

Errors in Both Variables 366 

Errors of Observation in Multiple Correlations 367 

CHAPTER 21 

MEASURING THE RELATION BETWEEN ONE VARIABLE AND 
TWO OR MORE OTHERS OPERATING JOINTLY 372 

Determining a Joint Function for Two Independent Variables 376 

Determining a Joint Function for Two Independent Variables, Houiing 

Other Independent Variables Constant 390 

Measuring Correlation with Respect to Joint Functions 391 

Determining Joint Influence of Thrice or More Independent Variables 391 



XVlll 


CONTENTS 


CHAPTER 22 

PAQH 

SUPPLEMENTARY METHODS FOR DETERMINING CURVILINEAR 
AND JOINT RELATIONS 396 

Determining Net Regression Curves by Mathematical Functions 396 

Supplementary Methods of Determining the Final Shape of Net 

Regression Curves 401 

Determining Joint Relations by Contours 404 

Determining Joint Relations by Definite Mathematical Function. 407 
Measures of Correlation for Mathematically Determined Regressions 412 

CHAPTER 23 

TYPES OF PROBLEMS TO WHICH CORRELATION ANALYSIS HAS 
BEEN APPLIED 415 

Land Values 415 

Physical Relations between Input and Output 416 

Weather Conditions and Crop Yields 418 

Relation op Physical Characteristics to Chemical Characteristics .... 420 

Relation op Farm Organization to Farm Income 421 

Relation op Economic Conditions to Market Price for a Commodity . . . 422 

Relation of Characteristics of Different Lots of a Commodity to Prices 

AT which They Sell 424 

Other Price Studies 427 

Relation of Changes in Production to Prices and Other Fach’ous 428 

Miscellaneous Agricultural Problems 429 

Correlation in Psychoi/>gy and Education 429 

Correlation Analysis in Other Fields 433 

More Recent Applications op Correlation Analysis 433 


CHAPTER 24 

STEPS IN RESEARCH WORK, AND THE PLACE OF STATISTICAL 
ANALYSIS 442 

Relation of Statistical Analysis to Research 442 

Stating the Objective 4.12 

Developing an Hypothesis ' 443 

Measuring the Factors 4.^t4 

Studying the Apparent Relations 445 

Running a Correlation Analysis 44f, 

Meaning of Correlation Results 45O 



CONTENTS 


XIX 


APPENDIX 1 

PAGE 

METHODS OP COMPUTATION 455 

Coefficients of Correlation and Regression 455 

Coefficients of Multiple Correlation and Nejt Regression 459 

Use of the Check Sum 461 

Tho Doolittle Method for Solving Normal Equations 464 

Standard Errors of Partial Regression Coefficients and of an Individual 

Estimate 469 

Coefficients op Partial Correlation 474 

Graphic Processes with the Short-cut Method 479 

APPENDIX 2 

TECHNICAL NOTES 486 

APPENDIX 3 

GRAPHIC CHARTS FOR INTERPRETING OR ADJUSTING CORRE- 
LATION CONSTANTS 504 

Reliability of Small Samples 504 

Reliability of Observed Correlations 504 

Adjustment of Correlation for Size of Sample 511 

APPENDIX 4 

LIST OF IMPORTANT EQUATIONS 512 

APPENDIX 5 

GLOSSARY 521 

REFERENCES 522 

INDEX 523 




CHAPTEE 1 


MEASURING THE VARIABILITY OF A STATISTICAL SERIES 

Statistical analysis is used where the thing to be .studied can be 
reduced to or stated in terms of numbers. Not all the undertakings 
that rely on measurements ordinarily employ statistical analyses. In 
surveying, physics, and chemistry, for example, the particular thing 
being studied can usually be measured so closely, and varies over 
such a small range, that the true value can be established within 
narrow limits. In fact, the concept of true value owes its existence 
to the reproducibility of measurements in certain fields. In many 
natural sciences, likewise, the problem to be studied can be simplified 
by the use of controlled experimental conditions, which permit the 
influence of various factors to be studied one at a time. Even in such 
sciences, statistical methods can be used to plan experiments in such 
a way as to make the conclusions most significant with a minimum 
of effort. In the social sciences, there are fewer opportunities for the 
use of controlled experiments. Such sciences have to rely on statistical 
analysis, both to judge the significance of observed differences and to 
untangle the separate effects of multiple factors. Statistical analysis 
is used in the study of occurrences where the true value or relation 
cannot be measured directly or is hidden by other things. The 
numerical statement of the occurrence or of the relationship cannot be 
obtained directly from the original or “raw” figures. Instead, the 
data must be analyzed to determine the values desired. 

The especial need for analytical methods in the social sciences 
has been clearly stated by an eminent Englishman, as follows: ^ 

Causation in social science is never simple and single as in 
physics or biology, but always multiple and complex. It is of 
course true that one-to-one causation is an artificial affair, only 
to be unearthed by isolating phenomena from their total back- 
ground. Nonetheless, this method is the most j)()werful weapon 
in the armory of natural science: it disentangles the chaotic field 
of influence and reduces it to a scries of single causes, each of 
which can then bo given due weiglit when the isolates are put 

^ Jnlifin Huxloy, The R(aon<‘e of society, Virginia Quarterly Review^ Voh 16, 
No. 3, pp. 348-65, summer, 1940. 


1 



2 


MEASURING VARIABILITY 


back into their natural interrelatedness, or when they are de- 
liberately combined (as in modern electrical science and its appli- 
cations) into new complexes unknown in nature. This method of 
analysis is impossible in social science. Multiple causation here is 
irreducible. 

The problem is a two-fold one. In the first place, the human 
mind is always looking for single causes for phenomena. The very 
idea of multiple causation is not only difficult, but definitely anti- 
pathetic. And secondly, even when the social scientist has over- 
come this resistance, extreme practical difficulties remain. Some- 
how he must disentangle the single causes from the multiple field 
of which they form an inseparable part. And for this a new tech- 
nique is necessary. 

The arithmetic average. The basic forms of statistical analysis 
have to do with organizing quantitative information as a basis for 
drawing inferences. Some of the basic work involves averaging and 
classifying data. Thus if one were studying the yield of corn in one 
year in some area, say a county, for example, he might talk with 20 
farmers picked at random and obtain figures, such as those in Table 1, 
showing the yield of corn which each farmer liad obtained. 

The most natural first step in reducing such a sei*i(‘s of observa- 
tions to more usable shape is to find the arithmetic average — to add 
all the yields reported and divide by the inimlx'r of items. The 20 
reports total 600 bushels, or an average of 30 busl\(‘ls.“ This jirovides 
a single figure into which is condensed one charaetei’istic of llie whole 
group. 

^ Bushels are used here to represent any other quantity in vvliirh one inip;lit bo 
interested in a particular case. If we let X' represent the number of l)iislu‘Is rei)orted 
by farmer 1, X" the bushels reported by farmer 2, X"' the bush(‘ls by fMiiiicr 2, and 
so on, we can then repre^sent the sum of all the reports by tlu‘ (‘Xpr(‘ssi()n 21 X (read 
'^summation of the X’s”). Similarly, if we use n to r(‘])r(‘S(ait tin* miinluM’ of observa- 
tions we have obtained and use Mx to represent th(‘ (iveragc (or rnrati) nuinlxM’ of 
bushels for all reports we can define the ariDmelic mean by tlie formula: 



n 


This formula can be applied to anything we arc sbidying, no matter wlu^ther X 
means bushels of corn, inches in height, degrees of tenqierature, or any ot-lua- riuMisur- 
able quantity; or whether there arc 2 cases or 2 million. This is a. p(*rfectly g(ai(M*al 
formula which can be applied to any given probkan. As st.atist.ies is a study of 
general mothods, bo stated that they can be applied to particula r probkans as (h‘sire(l, 
it will be necessary to use many general foimulns of this sort. The stiabait should 
therefore familiarize himself with the definitions given above and with tlu^ w'a,y they 
are used in formula (1), so that he will be able to understand and use each formula as 
it occurs. 



CLASSIFYING THE DATA 


3 


But the average is not the only characteristic of the group which 
might be of interest. The average would still be 30 if every one of 
the 20 farmers had had a yield of 30 bushels per acre; yet there 

TABLE 1 


Yields op Corn Obtained by Twenty Farmers* 


Farmer 

Yield 

Farmer 

Yield 

Farmer 

Yield 

Farmer 

Yield 


Bushels 


Bushels 


Bushels 


Bushels 


per acre 


per acre 


per acre 


per acre 

1 

29 

6 

33 

11 

29 

16 

33 

2 

25 

7 

26 

12 

35 

17 

31 

3 

38 

8 

28 

13 

26 

18 

37 

4 

30 

9 

30 

14 

23 

19 

28 

5 

27 

10 

29 

15 

31 

20 

32 


In mtiking entries in a table sucli this, the actual values may be "rounded, off " to any desired 
extent. In this case they are rounded to the nearest whole bushel. For example, "33 bushels" 
represents any report of 32.5 bushels or more, and any up to but not including 33.6 bushels. If the 
original reports were secured to the nearest tenth bushel, this might be indicated by writing "32.6- 
33.4" instead of "33"; or if secured to the nearest hundredth bushel, by writing "32.60-33.49." 
The entry ‘32.5 to 33.5" will be used to indicate "from 32.6 up to but not including 33.5," whereas 
32.5—33.4 will be used to mean "from 32.6 to 33.-1, both inclusive." 

certainly would be a significant difference between 20 reports each 
of 30 bushels, and 20 reports ranging from 23 to 38 bushels, even 
though both did have the same average. 

Classifying the data. One way of showing the differences in the 
individual reports is to arrange them in some regular order. If tlie 
fanners interviewed have simply been visited at random, and not 
selected so that those visited first represent one portion of the county 
and those visited later another portion, the order in which the records 
stand has nothing to do with their meaning. As a first step to seeing 
just what the data do show they can be rearranged in order from 
smallest to largest, as shown in Table 2. 

TABLE 2 

Yields of Corn on 20 Farms, Ahranoed in Order of Increasing Yields 


Bushels per acre 


23 

28 

30 

33 

25 

28 

30 

33 

26 

29 

31 

35 

26 

29 

31 

37 

27 

29 

32 

38 



4 


MEASURING VARIABILITY 


It is now easier to tell from the series something about the group 
of reports. One can now see that only 1 farmer had yields of less than 
25 bushels per acre, and only 2 had more than 35, so that 17 out of 
the 20 had 25 to 35, inclusive. The series shows, too, that 10 of the 
farmers had less than 30 bushels of corn per acre and 10 had 30 or 
more, so that the figures 29 and 30 mark the middle of the number 
of yields reported. If we divide each half into halves again, we see 
that 5 men had yields of 27 bushels or less, 5 had yields of 33 bushels 
or more, whereas 10 men — half of those reporting — had yields of 28 to 
32 bushels, inclusive. This tells something about how variable yields 
were from farm to farm in the area from which the reports were 
secured — half the reports fell within this 5-bushel range.'"* 

Even as rearranged in Table 2, the 20 reports still constitute a large 
tabulation. If there were several hundred, such a listing would be so 
unwieldy that it would be difficult to use. 

Frequency tables. The records can be studied more easily if, in- 
stead of writing '*29” three times when there are 3 fanners with 29 
bushels each, we simply show that each of 3 men reported 29 busliels. 
Similarly, instead of putting '*30” down twice, we can show tliat 30 
bushels were reported by 2 men. If this operation is performed for all 
the reports, the data can then be assembled into what is known as a 
‘'frequency table.” It shows the frequency, that is, the number of 
times each yield of corn was reported. 

In preparing a frequency table such as Table 3, spaces are put 
in for all yields (such as 24 bushels) for which no re])ort,s were re- 
ceived, but which lie between the largest and the smallest report, to 
show clearly that no such yields were reported. 

Table 3 is an improvement on Table 2, but it is still ])rctty long— 
and if the lowest yield had happened to be 15, say, and the highest 
60, it would have been longer still. For that reason it is frecpuMitly 
desirable to group the reports, not only for a yield of a spe(*ifuMl 
number of bushels but for yields within a certain ranges of bushels. 
Thus Table 4 is just the same as Table 3, except that, instead of show- 
ing the number of reports by individual bushel groups, it shows the 
number of reports for groups covering 3 bushels. 

The presentation is now condensed enough so that it can be readily 


2 In vstatistinal terminology, the figure that clividos tho nunihc'r of nu>orls into 
halves — as 29.5 in this case — is termed the mrduin; and tho figun's that divide the 
ntimhers into quarters — as 27.5 and 32.5 — are termed die lower and upper qtturtlles. 
The difference between the two qiiartiles, within which the central half of the 
reports fall, is termed the iwierquariile raiiyc. 



FREQUENCY TABLES 


5 


understood. It is easy to see that most of the reports fell around 
25.5 to 34.4 bushels and that more fell near 30 bushels than any- 
where else. Of course, the 3-bushel group is purely arbitrary, and 

TABLE 3 


Frequency Table, Showing Number op Times Each Yield was Reported, by 

Individual Bushels 


Yield of Com 

Number of times 
reported 

Yield of Corn 

Number of times 
reported 

Bushels 


Bushels 


23 

1 

31 

2 

24 

0 

32 

1 

25 

1 

33 

2 

26 

2 

34 

0 

27 

1 

35 

1 

28 

2 

36 

0 

29 

3 

37 

1 

30 

2 

38 

1 


any other convenient “class interval,^^ as it is called in statistical 
terminology, could have been used. Thus, if a 5-bushel class interval 
had been selected, the convenient groups 19.5-24.4, 24.5-29.4, 29.5- 

TABLE 4 


Frequency Table, Showing Number op Timer 
Each Yield was Reported, by 3-Busiiel Ghoupr 


Yield of coni 

Niunber of 
times reported 

Bushels 


22.5-25.4 

2 

25.5-28.4 

5 

28.5-31.4 

7 

31.5-34.4 

3 

34.5-37.4 

2 

37.5-40.4 

1 


34.4, and 34.5-39.4 bushels could have been CHtablislied, giving fre- 
quencies of 1, 9, 7, and 3 for the four groups. Just what class inter- 
val makes the most satisfactory table for any given set of data 







MEASURING VARIABILITY 


depends upon how the data run and how much detail it is desired 
to show. Where convenient, class intervals of 10 or some fraction or 
multiple of 10 are most convenient — ^the example just given shows 
how much easier it is to comprehend the 5-bushel classes than the 
3-bushel.^ 


Measures of Deviation 

The average deviation. Table 4 shows, in fairly compact form, 
the way that the several individual reports fall on each side of the 
average value. For some uses, however, it is desirable to have a single 
figure which expresses the ^^scatteration” of the whole group of re- 
ports, in just the same way that the arithmetic mean expresses the 
average yield of the whole group. 

One way in which the tendency of the group to scatter either fai 
from, or close to, the mean may be measured is by finding out how 
far, on the average, each report lies from the mean. The following 
tabulation illustrates the way in which this can be done: 

TABLE 5 


Computation of Average Deviation prom the Mean 


Original report 

Mean 

Report minus 
the mean 

Bushels 

Bushels 

Bushels 

29 

30 

-1 

25 

30 

-5 

38 

30 

8 

30 

30 

0 

27 

30 

-3 

Total 


f)0t 


* Tlie remaining 15 reports are not shown in tins table, though itic, hided in the total, 
t The plus and minus sign.s are disregarded in making this total. 

, 1 bushels 

Average deviation = — = 3 bushels 

Where there is a tendency for the reports to be grouped around certain values, 
such as 5, 10, it is desirable to take the class intervals so as to make these values 
fall in the middle of the groups. Thus, with a concentration on ov(‘n 5’s and lO’s, 
the groups 2.5-7.4, 7.5-12.4, 12.5-17.4, etc., may be used. 







AVEEAGE DEVIATION 


7 


In computing the average deviation, the plus and minus signs 
are disregarded in adding up the individual differences from the 
mean.® 

The new figure, 3 bushels, is the average deviation of all the re- 
ports. It shows that the 20 individual reports differed from the mean 
yield of 30 bushels by an average of 3 bushels each. This furnishes 
a single figure which expresses how much or how little the individual 
yields differed from the average yield. If the group of 20 reports 
were being compared with another group of 20, all of 30 bushels 
each, the average deviations of the two sets would indicate at once 
the difference in their make-up, even though both sets had exactly 
the same average value of 30 bushels. The second set, with all the 
reports exactly equal to the average, would have an average deviation 
of 0, as compared to the 3-bushel average deviation for the first set. 

® Before writing the general formula for the average deviation it is first neres- 
saiy to have some way of writing any deviation. Using X to indicate any given 
report, as before, and Mw to indicate the arithmetic average of all such reports, the 
small X will be used to indicate the deviation of each report from the mean of all, 
thus: 

X — Ms - X (2) 

X' - Ms == x' 

X" - Ms ^ a;" 

and so on. 

Similar to the previous usage, (read ^'summation of all the small a;’s”) is 
used to indicate the sum of the values such as x, of, x", etc. 

The average deviation, denoted by the sign 8, is then defined by the following 
equation : 

Xx (taken without regard to sign) (3) 

o = ^ — 

n 


It is necessary to disregard the signs in taking this sum, ns otherwise the sum 
would be zero. If the signs were not disregarded, the values added would be as 
follows : 

For item 1, a: (= A — Ms) 
item 2, x' (=A' - Ms) 
item 3, x" (= A" - Ms) 

and so on to the last item 

item n, Xn ( = An — Ms) 

So when the deviations were summed, 

- 2:A - nMs 


but 


SA 

X ” } nli^ X ” SA" 

n 


hence 


= 0 



8 


MEASTJUING VAEIABILITY 


Whereas the arithmetic average is a measure of the central tend- 
ency of a group of reports, the average deviation is instead a measure 
of the '^scatteration^’ of the individual reports — of their tendency to 
lie near to, or far from, the central value. 

The standard deviation. How far a group of reports tends 
to scatter from the mean of the group may also be measured by an- 
other coefficient which has certain advantages from a mathematical 
point of view. This measure is based on the deviation of each report 
from the mean, just as is the average deviation. After the individual 
deviations are computed, each one is then squared. These squared 
values are added together to give the sum. This sum is then divided 
by the number of items, and the square root extracted of this average 
of the squared deviations. 


TABLE 6 


Computation of Standard Deviation from the Mean 


Original report 

Mean 

Report miniiK the 
mean (— deviation) 

l>viation.s squared 

Bmhels 

Bushels 

Bmheh 

Bushels 

29 

30 

-1 

1 

25 

30 

-5 

25 

38 

30 

8 

64 

30 

30 

0 

0 

27 

>iG 

30 

-3 

9 

Total 



2S8 






* The remaiuing 15 reports are not shown in this table, tlioiigh included in tlu* total. 


The sum of tlie squared deviations, as shown in Tabic 6, is then 
divided by the number of items included in the group, and the 
square root of the result computed. The computation is as follows : 


288 

20 


14.4 


Standard deviation = V 14.4 = 3.79 bushels ® 


®The Greek letter a is used as the sign for the standard deviation. Using x to 
represent individual differences from the mean, as before, for the square of each 



STANDARD DEVIATION 


9 


The new value, 3.79 bushels, is called the standard deviation.^ (It 
is sometimes called the root-mean-square deviation, because it is the 
square root of the mean of the squares of the individual deviations.) 
In comparison to the average deviation, which was found to be 3 
bushels, it is somewhat larger. That is a relation which always holds 
— the process of squaring the deviations tends to emphasize the larg- 
est deviations more than does merely averaging them together. With 
well-distributed observations, so that the distribution is “normal” 
or nearly “normal,” the standard deviation is about one and a quarter 
times as large as the average deviation.® 

of such deviations, and for the sum of all such values, the standard deviation 
is defined mathematically by the formula 

& 

= J— (4) 

\ n 

Where the arithmetic average is a fraction, so that computing each individuaj 
deviation and squaring it would take much arithmetic for accurate work, the 
standard deviation may be computed more easily by tlie following formula: 



Here the original X values are squared instead of the deviations from mean, or x, 
values. It can be readily demonstrated algebraically that the two formulas give 
identical values for o-*. 

Thus each x = X M® 

each I- =X^- 2XM^ + M| 

2®- = 2^2 - 22XM» + Sjl/| 

2A' = nMx 
XMl = nMl 

2®2 = Sjf!! - 2nM| + nMl 
2®2 = SA2 - nMl 

^ For a shorter method of computing the standard deviation, when there is a 
largo number of observations, see Note 1 at the end of this chapter. 

® A “normal distribution’' is such a one as will be obtained from a series of ob- 
servations of a variable influenced only by a large number of random or chance 
causes, each one small in proportion to the total. Thus the values secured by 
tossing a number of dice, and noting the si)ots at each reading, tend to conform 
to a “normal curve.” Variables composed of a large number of small, independent 
elements also tend to have a normal distribution. Since this distiibution can be 
studied mathematically, it is possible to work out theoretically many of its prop- 
erties. These theoretical characteristi(;s of the normal curve are valuable in study- 
ing data where the distributions are nearly normal. 


hence 

But 

and 

hence 

and 



10 


MEASUBING VARIABILITY 


The distribution of the observations shown in Table 3 is fairly 
regular. Most of the reports come at about the middle values and then 
thin out to both ends (that is, the distribution approximates normality) . 
In such cases the standard deviation gives a measure of the range 
within which a definite proportion of the cases will be included. 
Specifically, if we take the range from the distance of the standard 
deviation below the mean to the distance of the standard deviation 
above the mean, about 68 per cent of the records will be included. 
In this particular case the mean is 30.00 bushels, and the standard 
deviation is 3.79 bushels, so the range will be from 3.79 less than 
30.00, or 26.21, to 3.79 more than 30.00, or 33.79. Comparing this with 
Table 3, we find that 13 farmers reported yields between 26.5 and 

33.4 bushels, whereas 4 reported 26.4 or less, and 3 reported 33.5 or 
more. The range 26.5 to 33.4 thus included 13 out of the 20 cases, or 65 
per cent. This comes as close to the 68 per cent which would be 
expected for the range 26.21 to 33.79 provided the distribution of the 
data were normal as would be anticipated with only 20 observations. 

For some uses, the square of the standard deviation has advantages 
over the standard deviation itself. Just as the standard deviation, 
3.79 bushels in this case, may be thought of as measuring “vari- 
ability,^’ so the standard deviation squared, 14.4, may be thought of 
as measuring “average squared variability.” The term “variance” 
has been suggested by R. A. Fisher, an eminent Englisli statistician, 
to designate this squared variability, and that term will be used here- 
after in this book when the standard deviation squared is to be 
referred to. 

The relation of the three measures which have been discussed in 
this chapter — ^the mean, the average deviation, and the standard devia- 
tion — is illustrated graphically in Figure 1. Here the frequency 
distribution shown in Table 4 has been charted, showing the yield in 
bushels of corn along the bottom of the chart, and the number of 
reports falling in each group along the sides.® 

® Mathematically, the quantities which are measured from left to right, and 
shown along the bottom of the chart, as the bushels of corn are here, are called the 
“abscissa^” whereas the quantities which are measured from bottom to top, and 
shown along the sides as the number of reports are here, are called the '‘ordinates.” 
Since any point in the whole chart can be located by telling how far it is from the 
left side, and how high it is from the bottom, these two items tell exactly wliere 
any particular point in the figure should fall. Thus the line for the group from 

28.5 to 31.5 bushels has for ordinate the height 7 farms, and the a])s(assa.s of Uio 
ends of the line are 28.5 and SI .5 bushels. The ordinate and abscissa, taken to- 
gether, are called the "coordinates” of a point. 



SUMMARY 


11 


Besides showing the number of reports included in each 3-bushel 
group by the height of the continuous line, the position of the mean 
in about the center of the group of reports is indicated, and likewise 

Number 



Yield of corn- bushels per acre 


Fig. 1. Frequency distribution of corn yields, and range above and below the 
mean included by average and standard deviations. 

the number of reports included within a range of both one average 
deviation and of one standard deviation on each side of the mean. 

Summary. This chapter has shown (1) how a series of measure- 
ments of any one variable, such as the yield of corn from farm to 
farm, may be classified into a frequency distribution which shows 
how the individual reports are distributed from high to low; (2) how 
an arithmetic average may be computed which shows the value around 
which all the reports center; and (3) how the variation of the in- 
dividual reports from the average may be summarized by computing 
the average deviation or the standard deviation, either one of which 
serves an indication of the variability of the items included in 
the particular series. Although these statistical constants, especially 
the arithmetic average, are frequently of value for themselves alone, 
tliey are discussed hero because it is necessary to know how they are 
computed and what tlu'y mean before the next propositions to be 
discussed can be fully understood. 




12 


MEASURING VARIABILITY 


Note 1, Chapter 1. Where the number of observations is large, the standard 
deviations may be computed more readily from a grouped frequency table than 
from the individual items. This process is illustrated in the following tabulation. 


Yield 

Number of 
reports 
(F) 

Deviation from 
assumed mean 
(d) 

Extensions 

dF 

d^F 

22.5 to 25. 6 

2 

-2 

-4 

8 

25.5 to 28. 5 

5 

-1 

-5 

5 

28.5 to 31.5 

7 

0 

0 

0 

31.5 to 34.5 

3 

+1 

3 

3 

34.5 to 37.5 

2 

+2 

4 

8 

37.5 to 40. 5 

1 

+3 

3 

9 

Sums 

20 


+ 1 

33 


The standard deviation is then calculated from the grouped data by the formula 

(6) 


(TU 


“ \ w L ^ J 12 


Substituting the values shown in the tabulation 

- (0.05)'> - 0.0833 = 1.25 


In making this computation, any coiiveiiient group may Ix’; selectc'd as the 
assumed mean, and the deviations of the other groups (d) calculai.ed jis (h'partures 
from it. This method assumes that all the cases in each gi'oup fall at tlu' (U‘nt(T of 
the group. With most variables, with a tendency toward a normal distribution, the 
average of the items in each group will fall somewhat lu'aia'i' the (tenter of the dis- 
tribution than the midpoint of the group, so the use of this method U^nds to giv<‘ too 

large a value for the standard deviation. The correction - called ^^Sheppanl's 


correction’’ after its originator, makes an approximate allowance for this tendency. 
The c of the formula stands for the number of units of d in each class interval. 
Where a unit of 1 is used for each class interval, as in this problem, the correction 

becomes simply — — , to be applied to 
xZ 

In computing the standard deviation from a grouped frequency tabic, the <r cal- 
culated will be in terms of the units in which d is expressed. In the illustration, each 
unit in d — one class interval — represents 3 units in X, since the yields wc^re groupetl 
in 3-bushel classes. The standard deviation computed in terms of class intervals, 
(Ttt, is therefore only one-third as large as is the standard deviation in terms of X. 



NOTE ON COMPUTING STANDARD DEVIATION 


13 


The latter may be cjalculated from the former by multiplying by the number of 
units in each group. That is, 

a-a; = (units of X per class interval) au 

In this problem 

<ra: = 3 (1.25) = 3-75 

The resulting value, 3.75, found by the short-cut method, is seen to be almost the 
same as the exact value of 3.79 bushels, previously found by the longer method. The 
greater the number of cases, and the more nearly normal the distribution, the more 
time will the short-cut method save, and the more nea^'ly will its approximate result 
agree with the exact value found by the longer method. 



CHAPTER 2 


JUDGING THE RELIABILITY OF STATISTICAL RESULTS 

Almost without exception, the object of a statistical study is to 
furnish a basis for generalization. In a case like that discussed in 
the preceding chapter, for example, no one would be likely to visit 20 
farms scattered all over a county simply for the purpose of finding 
out what the yield of corn was on those particular farms. Instead, 
he might be studying the yield on those farms as a basis for determin- 
ing what the average yield of corn was for all the farms in the county. 
Stated in statistical terms, he would be finding out what was the 
average yield in a sample of farms, picked at random, with a view to 
determining what was about the average yield in the universe in 
which he was interested, that is, on all the farms in the county.^ 

Of course it would be possible to visit all the farmers in the county, 
find out exactly what yield each one obtained, and so get an average 
of all the yields in the whole county. But this process would not 
only be expensive but also in most cases would be a j^ure waste of time 
and energy. We need only take a large enough sample by a well- 
designed sampling method to satisfy ourselves to any desired degree 
of accuracy concerning the actual average for all the farms of the 
county. In this case, 100 records may enable one to determine the 
average yield quite as accurately as is necessary. Obtaining records 
from all the several thousand farmers in the county might add nothing 
to the significance of the results. 

Before considering ways of finding out how many recoi'ds would 
be needed in any given case, we might well discuss a little more 
fully what the process of statistical inference involves. Really, all 
that we do is to examine or measure a certain group of objects, and 
infer from the size or measurement of those objects, or from the way 
those objects behave, what will be the size of other o))jects of the 

1 These two terms, ''universe,” meaning the whole group of cases about which 
one is interested in finding out certain facta, and "sample,” moaning a certain 
number of those cases, picked at random or otherwise from all those in the par- 
ticular universe, are both used frequently in statistical work, and shotild be clearly 
understood. 


14 



ASSUMPTIONS IN SAMPLING 


15 


same sort, or how other objects of the same kind will behave. This 
process is also called induction, because from particular facts about 
particular objects we lead out (in duct) general conclusions as to what 
will be the facts for all such objects in general. Now of course we do not 
really know what the particular facts are for any particular object 
without actually examining that individual object. All that we can do 
is to separate off certain groups of objects which we know to be alike in 
one or more particulars, and then assume that they will be alike 
in other particulars too, even though we do not examine every one 
to prove it. In the case of our farms, all that we know about them 
is that they are in the same county. Now because they are in the 
same county, we may expect that the temperature will be about the 
same, the rainfall will be similar, and the growing season will prob- 
ably not be much different from farm to farm. We may also expect 
that the kind of soil will not be very greatly different from farm 
to farm, and that the fertility will be somewhere near the same. 
Finally, we may expect that the fields are equally well drained on the 
farms within the county.^ But these expectations are not necessarily 
matters of known fact — ^we may expect that they are so from our gen- 
eral knowledge of the particular situation and of other similar situa- 
tions. If the conditions agree with our expectations, generalizations 
from the facts of our sample to the facts of the universe as a whole 
may be correct; if conditions do not agree with expectations, then 
our general conclusions may be incorrect. In either case it is not 
merely a matter of statistical technique but also of prior or additional 
knowledge of the subject. All that the statistical technique can do 
is to provide us with an average (or other measure or description of our 
facts) and a statement of how much confidence we can place in that 
average under certain given assumptions. Those assumptions may not 
be correct in any given case, and then our conclusion will be incor- 
rect also; but that is not the fault of the statistics, but of the statis- 
tician; not of the facts, but of the use to which we try to put them. 

Assumptions in sampling. The basic assumptions upon which the 
theory of sampling rests apply both to the way in which the sample 
is obtained and to the material which is being sampled. With respect 
to the material sampled, the assumption is that there is a large ‘^uni- 

2 Obviously, these things would not be tnie in many sections. In hilly or moun- 
tainous areas temperature, rainfall, and length of growing season may differ very 
greatly within short, distances, whereas in other regions, such as the Coastal Plains 
areas, the soils may be so varied that very fertile and very infertile soils are 
jumbled together in a veritable crazy-quilt. 



16 


JUDGING RELIABILITY 


Terse’^ of uniform conditions, in that throughout the universe the 
individual items vary among themselves in response to the same 
causes and with about the same variability. With respect to the 
selection of sample, the values must be so selected (a) that there 
will not be any relation between the size of successive observations, 
that is, that the chances of a high observation being followed by an- 
other high observation will be just the same as of a low or a medium 
observation being followed by a high observation; (b) that the suc- 
cessive items in the sample are not definitely selected from different 
portions of the universe in regular order, but are simply picked at 
random so that the chance of the occurrence of any particular value 
is the same with each successive observation in the sample; and (c) 
that the sample is not picked all from one portion of the universe, 
but that the observations are scattered through the universe by purely 
chance selection.^ Where these assumptions are fulfilled, the sample is 
designated a “random sample,^’ and its reliability may be estimated by 
the methods now to be described. 

Taking up the question of how reliable a statistical average really 
is, we must first consider, “What is the meaning of reliable?^* If we 
are interested in corn yield, for exami)le, it is obvious that a perfectly 
reliable sample would be one whose average agreed exactly with the 
average yield in the county. But if we arc interested in knowing 
the average yield to within one bushel, then for that i)uri)ose the sample 
would be sufficiently reliable if its average came within one bushel 
of the average for the whole county. 

Variations in successive samples. Suppose that 20 farms had been 
visited at random, with the results already i)rcsentcMl. If wc wanted 
to find out how near we could expect the avcM-age from that sample to 
come to the average for the county as a whole, we might try taking 
another sample — visiting 20 other farms at random, and getting the 
average yield for those 20. If the average yield of the second sample 
differed from the average of the first sample by, say, 3 bushels, we 
should know that both could not come within one bushel of the true 
average; if, however, the average of the second saini)le came within a 

^ Where the items are so selected as to roproserit difTorent yxirtions of the uni- 
verse, it maybe called a ^‘stratified sample”; where they Jo-e all sch-eted from one 
portion of the universe, it may be called a “spot” s;iTni)l(‘. 

Where the universe is not completely uiiifona, a “sira t illed” .sample tends to be 
more reliable than a random .sample, while a “spot” Hiiin])le tends to he l(‘.ss roliidDle 
tluin a random sample. See G. IJ. Yule, Inlrothirtioa to Uk' TJicory oj 
l)p. 347 to 349 of sixth edition, for formulas as to the redial )il it, v of strut ifu'd and 
spot sainyiles. 



VARIATIONS IN SUCCESSIVE SAMPLES 


17 


half bushel of the first average, we should be inclined to place more 
confidence in it. If we repeated the process several tinaes over, and 
all the different samples had averages falling within one bushel of 
each other — say between 29.0 and 30.0 bushels — ^then we should feel 
pretty certain that the average yield for the county as a whole was 
29.5 bushels, or very close to it. 

Let us suppose that 15 more samples had been made, each from 20 
farms selected at random, and that when we tabulate the 16 averages 
from the 16 different samples, we have the following 16 values: 

TABLE 7 

Average Yield of Corn in One County, as Determined by 16 Different 
Samples of 20 Farms Each 


Sample 

Yield 

Sample 

Yield 

1 

Bushels per acre 
30.0 

9 

Bushels per acre 
30.3 

2 

27.5 

10 

28.9 

3 

29.3 

11 

29.3 

4 

30.6 

12 

28.0 

5 

29.8 

13 

29.2 

6 

31.1 

14 

30.9 

7 

28.3 

15 

29.1 

8 

20.6 

16 

30.4 


Although the 16 average's range all the way from 27.5 bushels 
for the smallest to 31,1 bushels for the largest, we can sec that most 
of them fall around 29 or 30 bushels. This is even more evident when 
we arrange the 16 rejiorts in a frequency table as shown in Table 8. 

Although there is some tendency for the averages to cluster around 
29 and 30 bushels, still there are several below 28.5 and several above 
30.5. The average for the whole group is 29.5 bushels, and the stand- 
ard deviation is 0.99 busliel, or, for practical purposes, 1 busliel. 

The fact that tlie standard deviation of tlie group of averages is 
1 Inishel tells us one thing alioiit the way they scatter, from what we 
already know about the meaning of standard deviation. It tells us 
that about 68 per cent of them will fall in the range between one 
standard deviation below the mean of all the averages and one stand- 
ard deviation above the mean. In this particular case, the mean is 29.5 
bushels, and the standard deviation is approximately 1 bushel, so the 
range of one standard deviation above and below the mean includes 



18 


JUDGmG BELIABILITY 


approximately 28.5 bushels to 30.5 bushels. Checking this against 
the array of averages shown in Table 8, we find that this range does 
include 10 out of the 16 cases, or close to the proportion expected. 

TABLE 8 


Frequency Table Showing the Number of Times Various Average Yields 
Were Obtained out of 16 Samples, by One-half Bushel Groups 


Yield of corn 

Number of aver- 
ages in group 

Yield of corn 

Number of aver- 
ages in group 

Bushels 


Bushels 


27.5-27.9 

1 

29.5-29.9 

2 

28.0-28.4 

2 

30.0-30.4 

3 

28.5-28.9 

1 

30.5-30.9 

2 

29.0-29.4 

4 

31.0-31.4 

1 


Now let us go back to our single original average of 30 bushels, based 
on visits to the original 20 farms. What we want to know is how reli- 
able that one average is. Stated another way, how much is that average 
likely to be changed if the study were made over again — if another 
sample of the same size were taken? 

In Tables 7 and 8 we have seen how it might actually work out 
if we did do the study over several times. AVe have seen that, in 
case the new averages did fall as shown in those tables, two-thirds of 
the new averages would fall within a range of 2 bushels. Further- 
more, those figures showed that all the different averages fell within a 
range of 4 bushels (27.5 to 31,5). But those conclusions w^erc obtained 
only after getting 15 more samples of 20 cases each, and making 15 
new averages, one for each sample. Is there any way to find out how 
much the single original is likely to vary from the true average without 
going to all the work of taking a number of new samples? 


Estimating the Reliability of a Sample 

If we could estimate the extent to which the averages from new 
samples would be likely to vary, without ever getting the neu) samples, 
then we should know something more about how much faith we could 
put in the particular average which we had already. For example, 
if in the present case we knew that, if we did go out and get a large 
number of new averages (such as those shown in Tables 7 and 8), 



COMPUTING THE STANDARD ERROR 


19 


those new averages would have a standard deviation of 1 bushel, this 
fact would tell us at once something about how much our one average 
was likely to be different from the real average on all the farms. For 
example, we should know that about 68 per cent of the averages would 
lie in a range of 2 bushels (one standard deviation on each side of the 
mean of the samples). The one particular average which we had 
obtained might be any one of all those in a distribution like that shown 
in Table 8. If we assume that the mean of all the samples would 
coincide with the true average, then, as we have just seen, the chances 
would be about 68 out of 100 that our average was one of the averages 
falling within one bushel of the true mean. If on the other hand we 
knew that the standard deviation of a group of new averages would 
probably be, say, 5 bushels, then we should know that we only had 
about 2 chances out of 3 of the mean of any one sample coming within 
five bushels of the true average. Obviously, when an average has 2 
chances out of 3 of coming within one bushel of the true average it is 
much more reliable than if it had 2 chances out of 3 of coming withih 
five bushels of the true average. 

Whether we can judge how reliable a given average really is 
depends, therefore, on whether we can tell what would be the standard 
deviation of a number of similar averages, computed from random 
samples of the same number of items drawn from the same universe. 
If we could tell exactly what that standard deviation would be, we 
should know how much faith we could put in the average we had — 
we should know what the chances were of its being changed if the 
study were made over. Even if we did not know exactly what the 
standard deviation of the whole group of similar averages would be it 
would be some help if we knew approximately what it would be, or if 
we had a minimum or maximum value for its size, so that there would 
be some measure of how much trust to place in the particular average. 

Computing the standard error. Fortunately, it is possible to esti- 
mate with some degree of accuracy what the standard deviation of a 
whole series of averages is likely to be, if each average is computed from 
a sample of the same size and drawn from the same universe.^ Except 
under the exact assumed conditions, which are seldom completely ob- 
tained in practice, this estimate is not necessarily the best that could be 
made. Even so, the ability to make a rough estimate is a tremendous 
aid to statistical investigators, for it affords some clieck on tlie de- 
pendability of results, without going to the expense that would be 

4 Note 1 of Appendix 2 giv^os the derivation of this formula and shows the spe- 
cific assumptions on which it is based. 



20 


JUDGING RELIABILITY 


involved in repeating every sample 15, 20, or more times, to make sure 
that a reliable result had been obtained. 

The method for computing the estimated standard deviation of the 
average involves just two values. These are (1) the standard devia- 
tion of the items in the universe from which the sample was drawn 
and (2) the number of items in the sample. We do not know the 
standard deviation of the items in the universe, however, and can only 
estimate it from the standard deviation of the items in the sample. 
It has been determined that an unbiased estimate of the standard 
deviation in the universe can be made by adjusting the standard 
deviation observed in the sample as follows: ® 

Estimated stand, dev. of the universe 


= (observed stand, dev. in the sample) 

In this case 

= 3.79 VH = (3.79) (1.026) 

= 3.89 

The standard deviation of the group of averages may next be 
estimated by dividing the estimated standard deviation in the uni- 



® Using the S 3 TTibol <r as before to mean the standard deviation observed in the 
sample, and a to represent the estimated standard deviation in the universe from 
which the sample was drawn, we can define the estimated value as 


<r = 



It may more readily be computed by the equation 


<r = 



(fi.l) 


( 6 . 2 ) 


The two equations are identical, as may readily be proved by combining equations 
(4) and (6.1). 

When equation (5) is used, c may be computed 


IZX^ - n 

= \ 

\ n — I 


- nMl 


(6.3) 



COMPUTING THE STANDARD ERROR 


21 


verse by the square root of the number of cases in the sample. Thus, 
for our original sample of 20 farms,® 

Standard error of the average 

estimated standard deviation of items in the universe 
square root of the number of cases in the sample 

3.89 bushels 

V20 

3.89 bushels 

"" ilT 

= 0.87 bushel 

In comparison with the 15 other averages, all shown in Table 7, we 
see that in this case the standard deviation of all the averages was a 
trifle larger than we estimated it was likely to be — 0.99 bushel, as 
compared to 0.87 bushel expected. It has already been noted that 
wliere a number of repeated samples are actually taken, this may 
easily occur. In practice, sampling rarely fulfills all the conditions 
on which the mathematical formula is based, and for that reason an 
average may be either less or more accurate than the estimated 

*Here the symbol a- denotes the standard deviation as before, the subscript x 
indicates that it is the standard deviation of the individual items that go to make 
up our sample, and the subscript M indicates that it is the standard deviation of the 
means which is to be computed, thus: 

ffx = standard deviation of the items in the universe, estimated by equations 
(6.1), (6.2), or (6.3). 

<rM = estimated standard deviation of the group of averages if similar samples 
were repeated = standard error of the mean of X. 

The standard error of the mean is then given by the formula 

<rM = "7^ (7.1) 

■s/n 

Here, just as in the previous formulas, n stands for th(^ number of it(‘ms in the 
original sample — the same items as those from which ax was corn])ut.('d. 

In some statistical textbooks, a different notation is followed from that used 
here. In those books the Greek letters are used to r(‘i)resent the true values existing 
in the universe, whereas corresponding Latin letters reprcwcait the valu(‘S for the 
same constants as determined from a particular sample. In this notation ax would 
mean the true standard deviation in the universe, whereas Sx would mean the stand- 
ard deviation observed in a sample. This use is referred to here for the information 
of students who may have occasion to refer to other textbooks using this other 
notation. 




22 


JUDGING BELIABILITY 


standard deviation indicates that it is likely to be. Even so, this 
estimated ^'standard deviation of similar averages’' is an exceedingly 
useful figure. Such an estimated standard deviation for an average 
(or any other statistical measure) is called the standard error of that 
average (or other statistical measure) . It serves as a standard measure 
to give warning of about how much that sample may give results 
which vary from the true facts of the universe, solely as the result of 
chance fluctuations in sampling. It gives some indication of how 
much confidence can be placed in the measures computed from a 
sample. 

Reliability of small samples. Where there are only a small num- 
ber of observations in the sample, the standard deviation of the 
averages from a series of such samples tends to be somewhat larger 
than the standard deviation estimated by means of equation (7.1), 
and the distribution of the averages from such small samples tends to 
be somewhat different from that for large samples. If there are 30 
or more observations in the sample, the difference is so small that it 
may be disregarded. The farther the number of observations falls 
below 30, the more serious the difference. A correction has therefore 
been worked out, by higher mathematics, to allow for this error in the 
estimated standard deviation where there are less than 30 observations. 
This correction is shown by comparing the difference between the sample 
mean and the true mean of the universe with the estimated standard 
error of the mean, and by indicating in what proportion of repeated 
samples of the same size this ratio will exceed given values. These 
proportions are shown in Table A and in Figure A on page 505.^ 

The table shows the proportion of the trials in which a sample of 
each given size will have an average which differs from the true 
average by more than the specified range. Thus, if there are a large 

^ Table A applies as stated only in the case of measures Riirh as tho arilhmot.ic 
average, which are computed from the original data by the determitiution of a 
single constant. Where the computation of the statistical measure involves siniiil- 
taneously determining two constants from the original data, n - 1 should be used 
for the ‘'number of observations in the sample.” This applies to the coefficient of 
regression. Where the computation of the statistical measure involves siniulta- 
neously determining a large number of constants, say j in number, from tho origi- 
nal data, then (a “f-fl) should be used for the “number of observations” in 
entering Table A or Figure A. Thus for a coefficient of partial regression, hi 2 .,‘J 4 r) 
obtained from a sample of 20 observations, 5 constants are involved, so 16 would 
be used as the “number of observations” in using Table A to judge the reliability 
of the computed value. (Subsequent chapters will explain the meaning of the new 
coefficients mentioned here.) 



RELIABILITY OF SMALL SAMPLES 


23 


number of observations in the sample, and we state that the true 
average lies within one standard error of the computed average, we 
should be wrong for 3 out of 10 such statements. (The exact propor- 
tion expected is 317 out of 1,000.) If there were 20 observations in 

TABLE A 

Proportion of Repeated Samples in Which the Ratio op the Error in the 
Mean to the Estimated Standard Error of the Mean Exceeds the Value 
Specified in the Left-Hand Column, for Various Sizes of Sample* 


Ratio of the error in the 
mean to the estimated 
standard error of the 

mean 

Size of sample {n) 

2 

4 

6 

10 

16 

20 

30 or 

more 

0 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

.50 

.7048 

.6514 

.0382 

.6290 

.6244 

.6228 

.6171 

1.00 

.5000 

.3910 

.3632 

. 3434 

.3332 

.3298 

.3173 

1.50 

.3744 

.2306 

.1940 

. 1678 

.1544 

.1500 

.1336 

2.00 

.2952 

.1394 

.1020 

.0766 

.0640 

.0600 

.0455 

2.50 

2422 

.0878 

.0544 

.0338 

.0246 

.0218 

.0124 

3.00 

.2048 

.0576 

.0300 

.0150 

.0090 

.0074 

.0027 

3.50 

.1772 

.0394 

.0172 

.0068 

.0032 

.0024 

.0005 

4.00 

.1500 

.0280 

.0104 

.0032 

.0012 

.0008 


4.50 

.1392 

.0204 

.0064 

.0014 




5,00 

.1256 

.0154 

.0042 

.0008 





Bused on article by “Student," New tables for testing the significance of observations. Metron 
V, No. 3, 105-120, 1925. 

* See Figure A, Appendix 3, for full net of vulue.s. 


the samples, and we made the same statement, we should be wrong 
33 times out of 100. For samples with only 2 observations, such a 
statement would be wrong 50 times out of 100, on the average. 

The estimated standard error of 0.87 bushel from our single sample 
of 20 cases, with an average of 30.0 bushels, would therefore tell 
us that 67 per cent of such samples would have averages which fell 
within a range of 0.87 bushel of the true mean. If our sample is a 
true random sample, we should then have 2 chances out of 3 of being 
right if we estimated that the real average yield for all the farms 
in the county, the year the sample was taken, was within 0.87 bushel 
of the average shown by the sample. 

It is important to keep in mind that the probabilities shown in 
Table A refer to the ratio between the error in the mean and the 
estimated standard error of that mean, and not to the error itself. 



24 


JUDGING RELIABILITY 


The size of the ratio will depend both upon the size of the error and 
the size of the estimated standard error. At times the ratio may be 
very large, even when the error in the mean is small, merely because 
the sample happened to be one that showed an exceptionally small 
standard deviation. Conversely, the ratio will at times be small, 
ilot because the error in the mean is small but because the sample 
happened to be one that showed an exceptionally large standard 
deviation. For this reason it is well to be cautious in interpreting 
the average from a very small sample, even though that sample seems 
to be very reliable, as judged by the size of its estimated standard 
error and by the probabilities of various departures from the true 
mean, as read from Table A. This brings up the subject of the 
standard error of the standard error, which is treated in the next para- 
graph. 

Standard error of the standard error. A small sample (say of 
30 cases or less) cannot serve as a satisfactory guide to the facts of 
the universe, even with the aid of Table A. With a small sample, 
not only do we not know the true value of the mean, but also we do 
not know the true value of the standard deviation from which wo 
estimate the standard error of the mean. Our estimate of the stand- 
ard error of the mean is itself subject to error. With very small 
samples, say of 5 to 10 cases, this introduces a degree of unreliability 
which no amount of calculation can fully correct. The results arc 
uncertain within wide limits, and only a larger sample, or several 
successive small samples, can reduce that uncertainty. 

The standard error of the standard error, stated in relative terms, 
depends solely upon the number of cases in the sample. It is ci)m- 
puted as follows: 

Relative standard error of the standard error ® 

1 

square root of two times (number of cases in sample — 1 ) 

® Using ctctm to represent the relative standard error of the estimated standani 
error, we may define it 

craM = - y: (7.2) 

V2(n - 1) 

A slightly more accurate estimate can be made by use of the equation 

1 

<^iTM = /- 

V n(w — 1) 

The differences between the two equations are, however, ncgligilde. See W. 
Edwards Demingand Raymond T. Birge, On the statistical theory of errors, Reviews 
of Modern Physice, pp. 119-101, Vol. 6, July 1934. 



MEANING AND USE OF THE STANDAED EKROR 


25 


For our sample of 20 cases 

= 1 _ 1 
” V2(20 - 1) 

= 0.162 

The standard error of the standard error, foi the sample sizes 
shown in Table A, is given in Table B. 


TABLE B* 

Relative Standard Error op the Estimated Standard Error of the Mean, 
FOR Varying Sizes of Sample 


Size of sample 

Relative standard 
error t 

2 

0.707 

4 

0.408 

6 

0.316 

10 

0.236 

16 

0.183 

20 

0.162 ‘ 


* Footnote 7, on page 22, applies to Table B as well, 
t Stated as a proportion of the estimated standard error. 


Table B illustrates how, with very small samples, even our estimate 
of the standarrl error of the average is subject to a wide zone of un- 
certainty. With 4 cases, its own standard error is 41 per cent of the 
value computed. 


Meaning and Use of the Standard Error 

It is good statistical practice, whenever an average is cited, to 
give with that average its estimated standard error, so that the reader 
will know about how significant that average is and not be led into 
using it to make comparisons or to draw conclusions that are not justi- 
fied by the number of observations which are summed up in that 
average. One way of doing this is to write the average followed by 
the statement ''plus or minus the standard error.'’ Thus, in the case 
wc have been considering, with the single sample showing an average 
of 30.0 bushels with a standard error of 0.87 bushel, and with only 20 




20 


JUDGING EELIABILITY 


cases in the sample, the correct statement is to say '^the average 
yield has been shown by the sample to be 30.0 zt 0.87 bushels (20 
cases).”® If a similar sample from a different area has shown the 
average yield to be 28 zb 2.0 bushels (20 cases), the reader would 
know that there was a fair chance that the true average yield was 
really the same in both areas, in spite of the difference shown by the 
two averages. 

The greatest value of the standard error does not lie in merely 
indicating how near the sample value may come to the true value, for 
two samples out of three, on the average of a number of such samples. 
In exactly the same way that we have seen that two-thirds of the 
averages from the samples usually fall within one standard deviation on 
either side of the true mean, mathematicians have determined for 
large samples that 19 out of 20 (95.45 per cent) of the samples will 
give averages which fall within two standard deviations of the mean, 
369 out of 370 (99.73 per cent) will usually fall within three standard 
deviations of the mean, and all but one case out of 16,667 samples 
(99.994 per cent) will usually fall within four standard deviations of 
the mean. 

When there are less than 30 observations in the sample, the 
tendency of the computed standard error to be misleading is even 
greater for high odds than it is for lower odds. Corrections to take 
this into account are also shown in Table A. Thus, with samples of 20 
cases, 6 samples out of 100 will give averages differing from the true 
average by more than twice the computed standard deviation, and 7 
samples out of 1,000 will miss the true average by more than three 
standard deviations. This last is three times the proportion of such 
failures which would occur in the long run with samples of over 
30 observations. With very small samples, the failures for high 
odds occur even more frequently. Thus, for samples with only 4 
observations, 14 samples out of 100 will differ from the true mean 

® The most general practice is to write after the average ±.6745 timos the stand- 
ard error (0.59 bushel in this case, so the statement would read 30.0 ± 0.59 bushels). 
This value, 0.6745ajf, is called the rrohahle enor of the mean, since it gives the 
range within which the chances are even that the true mean lies, when there ai-e 
more than 30 observations— and also the range without which the chances are even 
that the true mean lies. Since this tends to make the average appear rather more 
accurate than does the standard error, the practice suggested of using the standard 
error instead has been recommended by many competent statisticians. Wherever 
that is done, however, it would be well to insert a footnote explaining that it is the 
standard error, and not the probable error, which is being shown after the sign “± 



MEANING AND USE OF THE STANDARD ERROR 27 

by twice the computed standard error, and about 6 out of 100 will 
differ by three times the standard error, on the average. 

Where high reliability is desired, and only small samples are avail- 
able, it is very important to take into account the corrections shown 
in Table A. 

Interpreting the standard error in the illustrative problem. Ignor- 
ing for the time the lack of complete accuracy in our estimate of the 
standard error itself (page 24), we can interpret the statement that 
the average yield in the area studied was 30 zt 0.87 bushels in any 
of the following ways : 

a. If we state that the true mean lies within one standard error 
of the observed mean (between 29.13 and 30.87 bushels, in this case) 
each time we use a sample of this size, we shall be wrong in our state- 
ment one time out of three, on the average. 

fe. If we state that the true mean lies within two standard errors 
of the observed mean (between 28.26 and 31.74 bushels) each time we 
use a sample of this size, we shall be wrong in our statement one time 
out of 17, on the average. 

c. If we state that the true mean lies within three standard errors 
of the observed mean (between 27.39 and 32.61 bushels) each time 
we use a sample of this size, we shall be wrong in our statement one 
time out of 135, on the average. 

d. If we state that the true mean lies within four standard errors 
of the observed mean (between 26.52 and 33.48 bushels) each time we 
use a sample of this size, we shall be wrong in our statement only one 
time out of 1,250, on the average. 

Comparing these conclusions with the 16 samples shown in Tables 
7 and 8, we see that 2 of those samples did fall outside the limits 
given by twice the estimated standard error. If we had been so 
unlucky as to have got the worst one of these as our single sample, 
instead of the one which we actually did get, then we should not have 
hit the average even if we had used a range of twice the computed 
standard deviation as that within which we expected the true average 
to fall. On the other hand, every one of the averages fell within the 
range covered by three times the standard deviation. Even if, in 
picking our single sample, we had been unfortunate enough to draw 
the poorest one of the lot — ^the one which gave an average yield of 
27.5 bushels — and had used a range of three times the standard error, 
we should have been correct in our statement as to the range within 

Figure A, pape 505, which Rives in more detailed fonn the corrections shown 
in Table A, may be used to work out these odds. 



28 


JUDGING EELIABILITY 


which we expected the true average to lie. Then we should have con- 
cluded that the true mean fell somewhere between 24.3 and 30.7 bushels, 
which would have been wide enough to include the real mean. Of 
course, if we had taken four times the standard error, we should have 
been almost absolutely certain of including the true mean in the 
stated range, with only one chance in over 1,000 of being wrong. 

In most statistical work, three times the standard error is taken 
as the greatest extent to which a given observed constant is likely to 
miss the true value for the universe. Even though there is about one 
chance in 370 of being further off than this with samples of 30 or more, 
most scientists are willing to take the chance that their sample is not 
that one exceptional case. Por exceedingly important work, or where 
absolute accuracy of comparison is essential, even four times the 
standard error might be used; but for the general run of statistical 
problems, and with fair-sized samples, it would seem safe to regard 
three times the standard error as about the largest extent to which 
the conclusions might be out solely because of the chances of getting 
an unusual sample in random sampling. 

In view of the possibility of the standard error itself being in error, 
however, the number of observations should always be stated, as well as 
the standard error of the constant, particularly where the sample is 
small. 

Bias in sampling. The figure as to standard error tells nothing at 
all of how much error there may be because of bias in sampling. 
Thus, if in taking our sample of 20 farms, we had visited only the 
largest farms with the most prosperous-looking buildings, we should 
be very likely to get a sample which was not representative of all the 
farmers in the county, but simply of the better ones, and so might get 
an average yield, say 10 bushels, above the true average for the 
county. Even if we only selected our farmers to the extent of includ- 
ing those which were most willing to give us the figures w^e wanted, 
we might have a badly biased sample, as usually the best farmers 
and the most intelligent ones are most willing to answer such questions. 
We must depend largely on common sense and on other knowledge 
of the situation we are studying, and not on statistical computations, 
to tell us whether or not our sample is really representative of the 
universe we want to study. Thus we might compare the average 
size or value of the farms in our sample with the averages for all the 
farms in the county, as shown by the census reports, to see whether 
tliey were representative or not. All that the computed standard 



NECESSARY SIZE OF SAMPLE 


29 


error can tell us is about how closely it is likely to approach the aver- 
age (or other characteristic) of the group it does actually represent — 
whether that group is the one we meant it to represent or only a part 
of that group. This caution must always be kept in mind in using 
samples: Computed standard errors tell us how far our results may be 
off solely because of the chance of getting a poor sample with a limited 
number of cases; but they do not tell us how far we may be off because 
of a biased sample, which is not a fair selection from the universe we 
wish to study. 

Deciding on. the size of sample necessary to obtain a stated 
reliability. One other application of the standard-error formula re- 
mains to be mentioned. The way in which this formula can be used 
to estimate the reliability of the average from a given sample, when 
the number of cases is known, has already been explained. The same 
formula can be used to determine how large a sample would have 
to be taken in order to secure results within any reasonable assigned 
limits of accuracy. 

Thus it has already been shown that the records from 20 farms 
could be used to say that the true average yield lay somewhere 
between 27.39 and 32.61 bushels, with about one chance in 135 of 
that statement’s being wrong. How many farms would one have to 
^'isit to state the same average yield to within one bushel, with the same 
chance of the statement’s being wrong? The same formula which 
was used to determine the standard error of the average can be 
turned around to answer this question also. 

If we know that we want to get an average reliable to within one 
bushel, for a range of three times its standard error, then we know 
tliat the standard error of that average would have to be only one- 
third of a busliel. We may also assume that when wo take our larger 
sample, tlie standard deviation of the yields on the individual farms 
will be found to be not very different from wdiat it was in our sample 
of 20 cas('s, and so use the same standard deviation as wc did before. 

Taking the relation which was used in computing the standard 
error before, we have: 





In the new case avo have the required standard error given, % 
bushel; we are assuming that the estimated standard deviation for the 
universe from our larger sample will be 3.89 bushels, just as it was from 



30 


JUDGING RELIABILITY 


our sample of 20 cases. Substituting these values in our equation, and 
using n" to represent the number of cases required in the new sample, 
we then have 


f bushel = 


3.89 bushels 


When the terms are shifted around, this becomes 


Hence 




3.89 bushels 
^ bushel 


11.67 


rtf' = 136.2 


We therefore conclude that if a sample of 136 reports were ob- 
tained, we should probably get an average yield which would not differ 
from the true average yield for all the farms by more than one bushel 
in more than one such sample out of several hundreds of such samples. 
If any other limit of error was set, we could similarly determine how 
many reports would probably be necessary to satisfy that limit. 

In these computations we have ignored the standard error of the 
standard error. If we took into account the possibility that the true 
standard error might be larger than our computed standard error, 
we should need a still larger sample to be sure of the accuracy 
specified. 

Standard errors for other measures. This whole discussion has 
been in terms of determining how closely it was possible to ai)proxi- 
mate the true average from the average shown by a satnple. In 
exactly the same way standard-error formulas have been worked out 
indicating how closely it is possible to approximate the ti'iic values 
of other statistical measures (such as standard deviations, for exam- 
ple) from the values for those measures determined from a sample 
These are interpreted in much the same way as are the standard 
errors of averages; they will be referred to in subsequent chapters. 


Universes, Past and Present 


Any statistical measurement relates to something that is already 
past by the time the measurement can be analyzed. Thus our records 


The standard error of a standard deviation (<r<r) may be approximatoly dcf (t- 
mined by the formula 

” V2{n - 1) 



UNIVERSES, PAST AND PRESENT 


31 


of the yield of corn obtained must relate to some crop that has already 
been harvested. Yields for a crop still growing could only be fore- 
casts, and could never be precisely accurate until the crop was har- 
vested and was weighed or measured. Yet human beings cannot live 
in the past. Our measurements of past events can be of meaning to 
us only when we project them into the future, and use them as a guide 
to future conduct. In studying the yield of corn, for example, the 
actual realized yield of corn in a county in a given year, no matter 
how accurately measured, is already a matter of history. The only 
thing that can be significant in human affairs is the average yield in 
some future year, still to be produced. If we are planning an A.A.A. 
control program, for example, and wish to estimate how many acres 
will produce a given total bushelage, we shall always be dealing with 
future years. We can do nothing to change the past. Only the 
future can be affected by our actions. When we take the average 
yield for a past year as our ^hiniverse” to be studied, what we are really 
interested in knowing is usually something about the yield most likely 
to be secured in one or in a series of years in the future. Even 
if we took a census of the yield on all the farms in the county, we 
should not have all the facts about our true universe. That universe, 
whose values we really wish to estimate, is composed of the yields 
next year and in otlicr years still to come. Measurements of condi- 
tions in the past, no matter how accurately made, can serve only as 
one part of the basis for judging what the values in the future are 
likely to be. Analysis of w'hat has happened in a succession of years 
in the past may help us to make a better estimate of the future. Such 
analysis may show a steady upward trend, or a variation from year 
to year with rainfall, or other variations whose cause we do not know. 
But before wq can project the past trends into the future, we must 
understand what caused them, and judge whether those causes will 
continue to operate. These judgments are not a matter of statistical 
analysis as such but must be based upon scientific and technological 
study of all the forces at work. Thus a steady upward trend in 
cotton yields might reflect a rising price of cotton in the period studied, 
and a resulting increase in the quantities of fertilizer applied per acre. 
But equally w'^cll it might reflect a steady decrease in the total acreage 
(duo to crop control or other causes) and a concentration of the re- 
maining acreage on the better lands. Or it might reflect the gradual 
adoption of improved strains. A forecast of whether the upward 
trend would continue into the future would be materially different 
in the three cases. Besides the statistical facts, it would involve 



32 


JUDGING RELIABILITY 


non-statistical judgments as to whether the increase in price or the 
limitation of acreage or the improvement in seed was likely to continue. 

Whether we are dealing with the statistical characteristics of 
people or of crops or of prices or of atoms, the real universe for which 
we wish to estimate is the universe of future events. Our ability to 
forecast those events will differ widely from field to field. Presumably 
the characteristics of atoms or of chemical compounds will be less 
subject to change than will those of crops, and crops will be less sub- 
ject to unpredictable change than will prices. In each case, however, 
the statistical information gained from the study of past samples must 
be tempered by other knowledge of the situation, based on study and 
analysis which may be quite non-statistical in nature. When we move 
from the facts of the past to forecast the unknown universe of the 
future it is not the statistics but the statistician who is on trial. 
Unless he mixes an ample measure of anthropology or agronomy or 
economics or other appropriate scientific information with his statistics 
— plus a liberal dash of common sense — ^he may find his analysis of 
past events a detriment, rather than an aid, in judging as to the future. 

Summary. This chapter considers the question of how far statis- 
tical results derived from a selected “sample” drawn from a universe 
can be used to reach general conclusions as to the facts of the entire 
universe. 

The confidence which can be placed in any measure computed from 
a sample, say an average, depends upon how closely that average 
is likely to come to the true average of the whole universe. One 
way of determining that would be to collect additional samples, each 
of the same size. From the way the averages from each of these differ- 
ent samples varied one could judge how near the average from 
any one sample was likely to come to the true average. For samples 
which meet the conditions of simple sampling, another much more 
rapid way is to compute the standard error of the average, w'hich 
indicates the minimum extent to which the average is likely to be 
correct. With samples of over 30 cases, the true average will prob- 
ably be within twice the standard error from the observed average 
for 19 samples out of 20, and within three times the standard error 
369 times out of 370. This is the minimum error; where the number 
of observations is smaller, the possibility of error is larger, as is indi- 
cated by Tables A and B. 

The same formula can be used to estimate how large a sample 
must be taken to secure any desired degree of accuracy in the final 
average. 



SUMMARY 


33 


The estimated standard error does not take into account bias in 
selecting the sample, but only shows the chances of reaching incor- 
rect results even when an honest random sample is obtained. 

Even after the values in the universe have been estimated from the 
facts shown by the sample, the statistician must still remember that 
that universe is a past universe. In applying that knowledge to prob- 
lems of future action, he must give due allowance to the fact that the 
yet unborn universe of the future may never be identical with the 
past and dead universe from which his sample was obtained. 



CHAPTER 3 


THE RELATION BETWEEN TWO VARIABLES, AND THE 
IDEA OF FUNCTION 

Relations are the fundamental stufi out of which all science is built. 
To say that a given piece of metal weighs so many pounds is to state 
a relationship. The weight simply means that there is a certain rela- 
tionship between the pull of gravity on that piece of metal and the 
pull on another piece which has been named the “pound.^' We can tell 
what our “pound” is only by defining it in terms of still other units, 
or by comparing it to a master lump of metal carefully sheltered in 
the Bureau of Standards. If the pull is twice as great on the given 
piece of metal as it is on the standard pound, then we say that the lump 
weighs 2 pounds. If, further, we say it weighs 2 pounds per cubic inch, 
that is stating a composite relationship, involving at the same time 
the arbitrary units which we use to measure extent or distance in space 
and the units for measuring the gravitational force or attracting power 
of the earth. 

Relations between variables. Besides these very simple relation- 
ships which are implicit in all our statements of numerical description 
— ^weight, length, temperature, size, age, and so on — tliere are more 
complicated relationships where two or more variables are concerned. 
A variable is any numerical value which can assume varying or differ- 
ent values in successive individual cases. The yield of corn on dif- 
ferent farms is a variable, since it may differ widely from farm to 
farm. So is the length of time which a falling body takes to reach 
the earth, or the quantity of sugar that can be dissolved in a glass 
of water, or the distance it takes for an automobile to stop after the 
brakes are applied, or the quantity of milk that one cow will produce 
in a year, or the profit that a farm will pay in a year, or the lengtli of 
time it takes a person to memorize a quotation. In contrast to tliese 
variables there are other numerical values called constants, because 
they never change. Thus one foot always contains 12 inches; one 
dollar always is equal to 100 cents; and a stone always falls 16 feet 
in the first second (under certain specified conditions). Science, of 
any sort, ultimately deals with the relation between variable factors 

34 


RELATIONS BETWEEN VARIABLES 


35 


and with the determination, where possible, of the constants which 
describe exactly what those relationships are. 

The variables which have been mentioned may be used to illus- 
trate the way in which changes in one variable can be related to 
changes in another. Thus the length of time which a falling body 
takes to reach the earth varies with — ^that is, is related to — ^the distance 
through which the body has to fall. The quantity of sugar which 
can be dissolved in a glass of water varies both with the size of the 
glass and the temperature of the water. The distance it takes for an 
automobile to stop after the brakes are applied varies with the speed 
with which the car is traveling when the brakes are applied, the area 
of braking surface on the drums, the area of tire surface on the road, 
how tightly the brakes are applied, how much the car weighs, the 
kind of road, and so on. 

Then when we come to variables like the production of milk or 
the income on a given farm, or the time to memorize a quotation, 
we find the situation still more complicated. How much milk a cow 
will produce varies with her age, breed, inherent ability, and the 
richness of the milk, and with the kind, quality, amount, and com- 
position of the feed she receives, the way she is stabled and cared 
for, and many other similar factors. Similarly the variables which 
may affect the income on a farm — ^the size, the equipment, the crops 
grown, the livestock kept, the methods followed, the costs paid, the 
prices received, the rainfall — are so numerous that it would take an 
entire book merely to list and discuss the different factors affecting 
this one single variable. The time it takes to memorize a quotation 
may be affected by its length, the subject's age, sex, training, fatigue 
or freshness, his familiarity with the material discussed, and his interest 
in the topic. 

Yet it is precisely with relations between complex variables that 
many statistical studies must deal. The statistical methods which may 
be used to handle such problems can best be understood if presented 
first for the simplest cases, and then expanded to cover the more com- 
plicated ones. 

Suppose a physicist, knowing nothing about the exact nature of 
the relation between the distance a body has to fall and the length of 
time it takes, made some experiments to determine the matter and 
obtained the results shown in Table 9. 

Looking over these figures we see that there is some sort of general 
relation between the two. As the distance increases, the time increases 
also. But that is not uniformly true. In one case the distance in- 



36 


THE MEANING OF FUNCTION 


creased without there being any increase in the recorded time; in some 
other cases the recorded time was not the same even though the dis- 
tance was unchanged. 

TABLE 9 


Relation between Distance a Marble Drops and Time It Takes to Fall 


Distance traveled 

Time elapsed 

Distance traveled 

Time elapsed 

Feet 

Seconds 

Feet 

Seconds 

5 

0.6 

20 

1.1 

5 

0.5 

20 

1.1 

5 

0,6 

20 

1.2 

10 

0.9 

20 

1.1 

10 

0.8 

25 

1.2 

10 

0.7 

25 

1.3 

15 

1.0 

25 

1.2 

15 

0.9 

25 

1.3 

15 

1.0 




Graphic representation of relation between two variables. We 

can get a better idea of just exactly what the relation is if wc ^‘plot’^ 
it on cross-section paper, so that we can see grai)hically just liow 
the time does change with the distance. Figure 2 illustrates the way 


Trme_e lapsed 
Seconds 


K4 

K2 

1.0 

0,8 

0.6 


0.4 


0.2 h 

0 + 
0 


for feet on J /S seconds 



^/^of for 5 fee f 
-ic- and 0.6 second 


5 10 15 20 25 30 

Distance fallen - in feet 


Fig. 2. Method of ('onstnictin^j; a clot chart. Time elapscMl is the dept'ndont 
variable, and the distance is the independent variable. 


this is usually done. The units of one variable, in this case the distance 
to be traversed, are measured off from the left, starting with zero in 
the lower left-hand corner and counting over toward the right. The 


RELATION BETWEEN TWO VARIABLES 


37 


anits of the other variable, in this case the time elapsed, are measured 
off from the bottom, starting with zero and counting up toward the 
top. If negative values are present, then the counting is started with 
the largest negative value, decreasing from left to right or from bottom 
to top, until zero is reached and the positive values begin to appear. 

Where one variable may be regarded as the cause and the other 
variable as the result, it is customary to put the causal variable along 
the bottom. In this case it may be said that the differences in distance 
traversed cause the differences in time elapsed. Distance, therefore, 
is measured in the horizontal direction, and time in the vertical. There 
is no particular reason for plotting data just this way except that this 
is the customary way of doing it and so it is most readily understood 
by other persons. Some relations of this sort can be reversed, so 
that either may be regarded as cause and either as effect.^ 

Having laid off the chart in the way indicated, we next ^^plot” the 
individual observations. The way this is done is illustrated in Figure 2. 
The first observation was that it took 0.6 second for the marble to 
fall 5 feet. This is indicated on the chart by counting over to the 
5-foot line from the left of the chart, and then counting up along that 
line until 0.6 second is reached. A dot is placed on the chart at that 
point. As indicated, this dot is at the intersection of the line starting 
from the ‘^0.6 second” at the left of the chart and extending parallel to 
the “0-sccond” line, with the other line starting from ^^5 feet” at the 
bottom of the chart and extending parallel to the “0-foot” line. Simi- 
larly, the last observation, 25 feet in 1.3 seconds, is indicated by a dot 
where the horizontal line representing 1.3 seconds crosses the vertical 
line representing 25 feet. 

Entering a dot for each individual observation in the same way, 
we get the chart shown in Figure 3. This figure now gives a visual 
representation of the way in which the length of time changes as the 
distance traversed changes. Such a chart is known as a “dot chart” 
or a “scatter diagram.” 

But even this figure does not show the exact relation between the 
distance and the time. Both the first and the second trials were for 
exactly the same distance, yet the time was slightly different. Obvi- 
ously that difference in time could not have been due to tlie difference 
in distance between the two, because there was no difference. The 
investigator must therefore assume tliat some outside cause, perhaps 
the accuracy with which the time was measured, may have been 

1 For a more extended discussion of this point , see pp. 50 and 51. 



|8 THE MEAlSriNG OF FUNCTION 

responsible for these slight differences. It will be noted, too, that when 
the different observations are plotted as in Figure 3, they come 
close to all lying along a continuous curve. We also see that the 
individual cases do not adhere absolutely to a continuous curve. If 
we are willing to assume that all the differences between the different 
observations at the same point along the curve are due solely to 
extraneous factors, we can estimate the true effect of the distance, 
by itself, by averaging together the several observations as to time 
taken for each of the several tests for the same length of fall. A 

Time elapsed 



Distance fallen-in feet 

Fig. 3. Relation of distance a marble falls to time elapsed in falling, as shown by 
individual observations and curve of average time. 

continuous curve drawn through these averages would then indicate 
the way in which the duration of fall varied with the distance, on the 
average of the cases studied. Although it might not hold true for 
any one individual case, as we have just seen, still it does indicate 
about what the time will be. For practical purposes we may say that 
under given conditions the time a body takes to fall is determined 
by the distance which it has to fall. 

The average time for each distance is indicated by the small circles 
in Figure 3. It is evident that all these averages lie very close to the 
smooth freehand curve which has been drawn on the chart. 




EXPRESSING A FUNCTIONAL RELATION MAITCEMATICALLY 39 

Expressing a functional relation mathematically. The relation 
shown by the curve in Figure 3 is what mathematicians call a /wnc- 
tional relationship; the time it takes a body to fall is a function of the 
distance which it has to traverse.^ All that this means is that for any 
particular distance-fallen, there is some corresponding time-required. 
The term “function” means that there is some definite relation be- 
tween the two variables, number of feet and number of seconds, but 
it does not at all tell just what that relationship is. When, however, 
it is said that time is a function of distance according to the curve 
shown in the figure, then the statement has been made perfectly defi- 
nite. The curve shows, for any given distance, exactly how long it 
will take a body to fall, on the average of a series of trials. 

In this particular case the function is defined only by the graphic 
curve. It may also be stated as a mathematical expression 

y = iVx 

using X for distance in feet and Y for time in seconds. This equation 
corresponds to the curve in a peculiar way, in that if any value of X is 
substituted in it, and then the value of Y determined, that will be the 
value of Y — ^time in seconds — corresponding to that particular value 
of X — distance in feet — as shown by the curve in Figure 3. This 
equation is therefore the equation of the function, since this simple 
mathematical expression tells just as much about the relation between 
the two varying quantities — ^tirae and distance — as does the entire 
curve in the figure. 

The way this equation is used may be illustrated by two examples. 
Suppose a marble falls 16 feet; how long should it take to fall? The 
value of X would then be 16; substituting this value in the equation, 
we have 

r = I vie 
y = i(4) 
y = 1 

This gives a value of 1 for y, which means that it would take 1 
second to fall. Suppose again a bomb were dropped from an airplane 

* Using y for time and X for distance, we state this mathematically 


V=f(X) 



40 


THE MEANING OF FUNCTION 


10,000 feet high. How long would it take to reach earth? The value 
of X is then 10,000; substituting this value in the equation, we have 

Y = i Vl0,000 

Y = i(lOO) 

F = 25 

The result F = 25 means that it would take 25 seconds for the 
bomb to fall.^ 

It is evident that the equation goes much further than does the 
graph of the curve. The latter gives the relation between distance 
and time only for the distances which are shown on the chart. The 
equation, on the other hand, gives the relation for any distance 
whatever, no matter what it may be. It is possible to state this 
law of gravityj as it is called, in an equation only because physicists 
have studied this relation in the past and determined exactly how 
the one quantity varies with the other. Having found that tlie same 
relation between the two variables held through their entire range of 
observation and having worked out on philosophical grounds a good 
reason why that relation should hold, they have felt safe in coming 
to the conclusion that it will continue to hold even beyond tlie range 
of the experimental verification.^ Where only a graph of the function 
is available, on the contrary, only the relation within the stated range 
is known. The graph docs not tell, of and by itself, the direction the 
curve would take if extended beyond the limits determined by the ex- 
periments. 

Now if instead of the relation we have just been discussing we 
consider the relation between the quantity of sugar which can be 
dissolved in a glassful of water and the temperature of the water, we 

3 Outside causes, such as friction with the air, may make the l ime of fall slightly 
different from the calculated time; therefore with so long a fall as this the time 
might differ quite perceptibly from the theoretical time gi\'eu by the equation. 
This equation gives the time required when no influence other Uian gravity is taken 
into account. Obviously a marble would fall in air much fast('r than a feather — 
the resistance of the air has very little influence on tlie speed of the marble and a 
great deal of influence on the speed of the feather. In a vacuum they would fall 
at the same rate. 

^ It should be noted that for very great distances— say 10,000 miles— the formula 
might need to be modified, since then the pull of the earth would be less than it is 
at the surface. The equation holds true only for those distances from the earth 
within which its pull is practically a constant. 



EXPRESSING A FUNCTIONAL RELATION MATHEMATICALLY 41 

have quite a different problem, and yet one that is similar in many 
aspects. If we start to determine it experimentally, we must first 
make sure that the quantity of water with which we are working is 
the same in every trial; then we must measure accurately both the 
temperature of the water and the amount of sugar which could be 
dissolved in it. Water expands when it is heated, and it also has a 
tendency to evaporate ; so we would have to decide whether we wanted 
the same volume of water, irrespective of the fact that at a higher 
temperature there would be actually less water in that volume, or 
whether we wanted the volume of water equivalent to what would 
be the same volume at a given fixed temperature. (This would 
necessitate determining the relation between volume and temperature 
for a given weight of water as a preliminary study, or else using 
weight instead of volume as our criterion.) Many other similar factors 
which might possibly influence the result would have to be considered 
before even the exact plan of the experiment could be drawn up. 

Once the experiment had been run the numerical results would 
probably be somewhat similar in character to those in the gravity 
test. It would be found that about the same quantity of sugar was 
dissolved in a given quantity of water when repeated tests were made 
at the same temperature, but that the quantities varied slightly from 
each other. If the data were plotted on a scatter diagram like 
Figure 3, it would be found that the data fell in the general shape of 
a curve, but that very few of the dots fell exactly on the curve, 
some lying above and some below the continuous line which could be 
drawn about through the center of them. Again we might conclude 
that these slight differences from exact agreement were due to factors 
other than the temperature of the water — ^to slight experimental errors 
in the quantity or temperature of the water, or to slight errors of 
measurement in determining the quantity of sugar — and be willing to 
conclude that the line drawn through the center of the series of observa- 
tions showed the real effect of differences in temperature on the quan- 
tity of sugar dissolved, when extraneous influences were removed. This 
again would be a junctional relation. The curve would express the 
relation between changes in temperature and changes in quantity of 
sugar, showing for any given temperature exactly how much sugar 
could be dissolved. It might then be possible to determine a type of 
equation which would accurately specify the function by a mathe- 
matical formula, similar to that discussed for the gravity example, if 



m 


THE MEANING OF FUNCTION 


the logical type of relation between the two variables could be worked 
out.^ 

Determining a functional relation statistically. In the two cases 
which have been discussed the relation between the two variables was 
sufficiently close so that by taking proper experimental precautions 
other influences which might affect the result could be largely removed 
and a series of observations obtained sufficiently consistent with each 
other so that the exact nature of the relation could be readily deter- 
mined. In many other types of relations this cannot be done so easily. 
It is with this type of relation that statistical methods really become 
important. 

If we were making a traffic study in a given city, for example, we 
might wish to know what would be the safe speed limits to permit 
on different streets. In that connection we might need to know in 
what distance an automobile could be stopped when traveling at dif- 
ferent speeds, so that by comparing this distance with the width of the 
different streets and the length of view at intersections we could judge 
how fast machines might be able to travel without risk of collisions 
at street intersections. One way to determine what is tlie relation 
between speed and stopping distance would be to make a number of 
tests in different portions of the city, taking different types of machines 
and different drivers. Let us suppose that as the result of such a series 
of tests we obtained the series of observations shown in Table 10. 

5 Some logical foundation is needed before a mathematical equation to a curve 
can be of any more value than merely the chart which graphs the curve. Thus in 
the gravity example it is evident that the farther a body falls, the faster it falls ; in 
every successive instant the speed it has already attained is increased by the effect 
of the continued pull which is added to it. Purely mathenuitical investigations of 
the relation between such constantly growing magnitudes and the variable with 
which they grow have enabled physicists to determine the general mathematical 
type to which the relation must conform. Then, knowing what the type of the 
curve is, we find it to be relatively easy to determine the constants (such as the 

of the equation y which makes the general equation applicable to 

a given specific case. This is done by using experimental results, such as those 
given in Table 9, to calculate the constants for the specific type of curve which has 
been determined upon. 

Not all functional relations can be subjected to this type of logical analysis, 
however, and it is sometimes impossible to tell what sort of equation the results 
should really follow. In that case any mathematical curve “fitted” to the data has 
no more special meaning than, the graphic curve drawn through the center of the 
observations; both are merely empirical descriptions of the relations, and both are 
limited in their interpretation to the range of the particular data upon which they 
are based. This fact will be discussed more fully later on. 


DETERMINING A FUNCTIONAL RELATION 43 

It is apparent from the table that there are great variations in the 
distances which different cars or different drivers required to stop, 
even when traveling at the same speed. This is shown even more 
clearly when we make a dot chart of the data in just the same way 
as illustrated in Figure 3. The graphic comparison between speed 

TABLE 10 


Relation between Speed of Automobile and Distance to Stop Aetek Signal, 
AS Shown by 60 Individual Obsekvations 


Speed when signal 
is given 

Distance traveled 
after signal 
before stopping* 

Speed when signal 
is given 

Distance traveled 
after signal 
before stopping* 

Miles per hour 

Feet 

Miles per hour 

Feet 

4 

2 

19 

46"' 

7 

4 

, 24 

93 

17 

60 

14 

26 

14 

36 

12 

28 

12 

20 

9 

10 

11 

28 

10 

34 

20 

48 

15 

20 

15 

64 

24 

70 

17 

40 

25 

86 

13 

34 

20 

64 

15 

26 

19 

36 

19 

68 

13 

26 

10 

26 

10 

18 

18 

66 

7 

22 

22 

66 

16 

40 

18 

84 

14 

60 

8 

16 

20 

52 

4 

10 

24 

120 

12 

14 

24 

92 

20 

66 

17 

32 

23 

64 

13 

34 

18 

76 

11 

17 

12 

24 

13 

46 

16 

32 

14 

80 

18 

42 

20 

32 


* Theaa obaervationa were made before 4-wheel brakes were commea. 



44 


THE MEANING OF FUNCTION 


and distance-to-stop, shown in Figure 4, reveals that there is only a 
general agreement between the different tests. There is certainly some 
relation between the two variables, but it is vague and uncertain in 
comparison with the relatively sharp and clear-cut relations shown in 
Figure 3. 



Fig. 4. Relation of speed of automobile to distance it takes to stop, as shown by 

individual observations. 

There is no particular difficulty in understanding why the relation 
is not more definite. The data represent a great variety of different 
elements — cars with two-wdieel brakes and cars with four-wheel brakes; 
cars wdth brakes in adjustment and cars with brakes well worn; cars 
nearly empty and cars heavily loaded; cars with balloon tires and 
cars with high-pressure tires. In addition, the drivers differ. Some 
are experienced drivers, some inexperienced; some strong and some 
unable to press the brakes fully down; some with almost instantaneous 
reaction to our signal to stop, some with faltering or lagging response; 
some bright and wide awake, others tired and unobservant; some calm 
and steady, others nervous and erratic. Finally the conditions of the 
tests might be different — some on concrete pavement, others on asphalt; 
some on up-grades, some downhill. 

There are two different ways by which we might go about deciding 
exactly what these varying observations showed. One way would be 
to divide up the data so that the effect of some of the different factors 



DETERMINING A FUNCTIONAL RELATION 


45 


mentioned would be removed from the results. Thus if we separated 
the observations into different groups according to the make of car, 
and then reported each of these groups according to the model or 
the year made, the relation between speed and distance for any 
single group would no longer be affected by differences in braking 
equipment so far as engineering design went. Most of the remaining 
factors, however, would still be present to affect the results, so that 
even within each subdivision the records would still show great 
diversity in the relation. Only if we continued the process of sub- 
division of our sample until we got down to successive observations 
of a single car operated by a single driver at the same place, would we 
be likely to get observations as consistent with each other as those in 
the previous physical and chemical illustrations. Differences in the 
promptness with which the driver responded to the signal, in the pre- 
ciseness with which the speed at the moment of giving the signal was 
observed, and possibly in the force with which the driver applied his 
brakes, all might influence the result, so that even then the remits 
might be less consistent — *^the curve be less definitely defined” — 
than in a series of laboratory experiments where all the important 
outside variables could be definitely controlled and so prevented from 
affecting the results obtained. 

Should the entire mass of observations be analyzed as suggested, 
that would give a great number of different sets of relations, each one 
showing how long it took a given car to stop when driven by a given 
driver, when traveling at different speeds. But this great number of 
different curves might not be suitable to answer our question. They 
might be so different from curve to curve that it might seem that there 
was no real general relation between speed and distance. A new car, 
with four-wheel brakes, driven by an experienced driver, might stop 
in its own length at the same speed at which an old car, with brakes 
nearly worn out, and driven by an inexpert driver, might require a 
hundred feet or more. Obviously neitlier one of these extremes would 
be typical of the general relation; but what would be typical? Even 
the less extreme cases might show great variations among themselves, 
so that it would be almost impossible to pick from the great diversity 
of curves one or a few that would serve as a basis of judgment for our 
I)roblein. 

A second way of going about it would be to try to determine some 
sort of average relation between speed and distance. In that case we 
should admit that there were great differences from the average in 
individual cases, yet should feel that the average would serve as a 



THE MEANING OF FUNCTION 


general indication of what the relation was, even though we were aware 
it would not be true in every, or perhaps even in any, individual 
case. If we knew nothing about a car except the speed at which it 
was moving, that average relation, however, would serve to give us 
the best guess we could make as to how far it would take it to stop. 
Since we should have to make our speed limits the same for all pas- 
senger cars, that might give us the best basis of judgment as to how 
high it was safe to place it. Of course we should also need to know 
something about how much more than the average time exceptional 
cars or drivers might require and how far above the average any large 
proportion of them fell, so as to decide how much leeway to allow; 
but even so, the average relation would be the first interest and the 
point of departure in reaching our decision. 

Where the relation between two variables is clear and reasonably 
sharply defined, as in the experimental case discussed, it is not difficult 
to determine the average relationship, since the relation for individual 
cases and the average relation for all cases are nearly identical. Where 
the relation is not so well defined, however, and where many other 
relations are involved in addition to the particular one which is being 
studied, it is by no means so easy to determine exactly what the 
true relationship is. A considerable body of statistical methods has 
therefore been developed to treat this particular problem. Since this 
problem pertains to the relation between variables, it has become 
known as the problem of co-relation, or ^'correlation. Just how 
statistical technique may be applied to the solution of the traffic prob- 
lem which has just been presented will be considered in detail in the 
next chapter. 

Summary, A statement of the change in one variable which ac- 
companies specified changes in another is known as a statement of 
a functional relation, A functional relation may be stated either 
graphically by a curve or algebraically by a definite equation. Al- 
though functional relations may be readily determined from experi- 
mental conclusions where all influences except the one being studied 
are held constant, many problems cannot be studied by such methods. 
The statistical methods of correlation analysis may be used to study 
functional relations where experimental methods are not satisfactory 



CHAPTER 4 


DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 
ANOTHER CHANGES: (1) BY THE USE OF AVERAGES 

The problem stated in the previous chapter was to determine how 
many feet automobiles traveling at a given speed require to stop. It 
involves determining the average extent to which one variable changes 
when another variable changes. Stated mathematically, the problem is 
to find the functional relation between speed and distance — ^the prob- 
able distance required to stop with any given initial speed. Of the 
many different ways of doing this, the simplest, and the one which 
would suggest itself most naturally, would be to classify the records 
into groups, placing all of one speed in one group, all of another speed 
in another group, making as many groups as there are different rates 
of speed recorded, and then averaging the different distances for all 
the cases in each group. This would then give an average distance 
to stop for each given rate of speed in the series of records. Table 11 
shows this operation carried out. 

Where there were only single observations, this fact has been 
indicated by placing the average — ^the single report — in parentheses. 

The averages in the last column of Table 11 show quite specifically 
how the distance required to stop tends to increase with the speed a 
machine is traveling. The machines which were tested at 12 miles 
per hour stopped at an average distance of 21.5 feet, those at 15 miles 
per hour at 33.3 feet on the average, and those at 20 miles per hour at 
50.4 feet. But the increase is not uniform. The cars at 10 miles 
per hour averaged a greater distance than those at either 11 or 12, 
and the cars at 19, a shorter distance than those at 18, 

If the successive averages from Table 11 are plotted and con- 
nected by lines, both the general increasing tendency and the irregular 
change from group to group are easily seen. Figure 5 shows this 
comparison (see page 49). 

Do these differences between the different group averages have any 
real significance? Is there any reason to think that this very jagged 

47 



48 


DETERMINING EUNCTIONS BY GROUP AVERAGES 


line is the true average relation between speed and distance? We 
can consider that from two points of view; the logic of the relation 
and the statistical basis of the differences. Logically the differences 
are quite nonsensical. If a given machine can stop in 22 feet when 
it is going 11 miles an hour, of course it can stop in at least the same 
distance when going 10 miles per hour, and probably something less. 

TABLE 11 

Computation op Average Distance to Stop after Signal, for Different 

Initial Speeds 


Speed when sig- 
nal is given 

Different distances 
noted for that speed* 

Average distance 
for that speed 

Miles per hour 

Feet 

Feet 

4 

2, 10 

6.0 

7 

4, 22 

13.0 

8 

16 

(16) 

9 

10 

(10) 

10 

26, 34, 18 

26.0 

11 

28, 17 

22.5 

12 

20, 24, 28, 14 

21.5 

13 

34, 26, 34, 46 

35.0 

14 

36, 26, 60, 80 

50.5 

15 

54, 26, 20 

33.3 

16 

32, 40 

36.0 

17 

50, 40, 32 

40.7 

18 

56, 84, 76, 42 

64.5 

19 

68, 46, 36 

50.0 

20 

48, 56, 64, 52, 32 

50.4 

22 

66 

(66) 

23 

54 

(54) 

24 

93, 70, 120, 92 

93.75 

25 

85 

(85) 


* Data taken from Table 10. 


It certainly would not take 26 feet, as the table shows. Then from 
the statistical point of view the groups are entirely too small to show 
very definitely how far on the average it takes to stop at any one 
speed. Even the largest group, at 20 miles per hour, has only 5 cases, 
whereas we have seen in Chapter 2 that 10 to 25 cases may be required 
as a minimum to give an average of much reliability. Computing the 
standard error for the average from the 20-mile group of reports, it 
comes out 5.3 feet. With only 5 reports, however, Figure A (in Ap- 



AVERAGES OF UNIT GROUPS 


49 


pendix 3) shows that we have to take a range oi 1.1 times the standard 
error to make the observed value come within that range of the true 
value in 2 samples out of 3. We may say that the standard error of the 
average, taking this into account, is 5.83 feet.^ The average for this 
group of records may therefore be written 50.4 zb 5.8 feet. When we 
say that the average distance required to stop when traveling 20 miles 
per hour (for all automobiles in town, say) is between 44.6 feet and 
56.2 feet, we are likely to be wrong in 1 out of 3 such statements, on 



speed of outo — in miles per hour 

Fio. 5. Relation of speed of automobile to distance it takes to stop, as shown by 
averages of small groups. 


the average. With the average from the largest group showing as 
little reliability as this, it is quite clear that the zigzag variation 
from average to average has no real meaning. So few cases are 
included in each group that the averages are not statistically reliable 
to anything like the individual dilferences. All the irregular differ- 
ences from group to group can therefore be accounted for by purely 
chance variations in sampling. It is quite possible that they are due 
solely to the small number of cases. As they have no statistical 
significance there is therefore no need to be worried about them. 

Does that mean that in spite of the relationship we can see in 

^ The standjird error is computt'd from the standard deviation of the five re- 
ports at 20 miles, using equation (7.1). This gives a value of 5.3. Figure A, in 
Appendix 3, shows that for five reports a range of 1.1 limes the computed standard 
error must be taken to secure a reliability of .67 (or probability of .33 for the 
specified departure), so the final standard error is (5.3) (1.1), or 6.83. 




50 


DETERMINING FUNCTIONS BY GROUP AVERAGES 


Figure 5 that we can get no accurate statistical measurement of the 
relation? That is overstating the case a little; all that we have 
determined so far is that the line of averages, the irregular function 
shown in Figure 5, has but little statistical meaning, just as it stands 
now. 

We might be able to make the results more accurate by basing 
our averages on a larger number of reports. As we have seen previ- 
ously, the more cases there are in a group the more reliable the 
average of that group is likely to be. One way of doing that would 
be to go out and get more records, so that we should have enough 
cases in each group to make the averages reliable within small enough 
limits to suit our needs. But that would be a long and expensive 
process. Isn’t there some way we can find out something more just 
from the records we have? 

Another way of making the conclusions more stable would be by 
combining the records so as to give fewer groups, but with more cases 
in each group. So far we have been working with 19 different groups, 
one for each of the 19 different speeds measured. If instead we group 
them into a few groups — say four or five — ^we shall have considerably 
larger groups to work with. 

Independent and dependent variables. The question might be 
asked whether the groups should be made on the basis of the rate 
of speed or of the distance to stop. (In preparing Table 11 we used 
the rate of speed without discussing the matter.) That comes back 
to the question of what we really want to find out. Do we want to 
know the average speed at which machines were traveling when it 
took them, say, 20 feet to stop; or do we want to know the average 
distance machines took to stop when they are traveling at a given 
speed? Obviously, the thing we are going to set is the speed limit, and 
we are merely interested in the distances to stop as one factor to guide 
us in deciding what the speed limit should be. We therefore want to 
know the effect of speed upon average distance, and not the reverse. 
For that reason we shall classify our records on the basis of speed, 
and then average together all the different distances for the cars 
traveling at that speed. 

The same question is met with in nearly all problems where the 
relation between two variables is to be dealt with. It is always neces- 
sary to think over the problem carefully, and decide which variable we 
are going to regard as the independent or causal variable, and which 
one as the dependent, or resultant. Thus if we were relating varia- 
tions in tobacco yields to applications of fertilizer, obviously the 
differences in fertilizer would be the cause and the differences in 



GROUPS OF LARGER SIZE 


51 


yield the result, so we would sort our records according to the differ- 
ences in fertilizer. Other relations may not be so clear cut. If the 
size of stores were being related to profits, it might be as logical in 
some situations to consider that the more successful men were able to 
afford the largest stores as to consider that the larger stores returned 
the greater profits. Careful consideration of the facts in each given 
case is necessary to clarify exactly what is the particular relation in- 
volved. 

As shown later (pages 113 to 121 and 450 to 451), it is frequently 
impossible to say which variable is the cause and which is the effect. 
All that can be definitely established is that the two vary together. 
Yet one may wish to regard one variable as the one whose values are 
given or known. It is then called the independent variable and 
plotted as the abscissa. The second variable will then be regarded as 
the one whose values are to be related to, or estimated from, the values 
of the known variable. It is then called the dependent variable, since 
it is treated as depending upon the given values of the independent 
variable. It is sometimes desirable in particular problems to consider 
first one variable as the independent variable and then the other one 
as independent. 

TABLE 12 


Average Relation between Speed op Car and Distance to Stop, as 
Shown by Records Thrown into Groups 


Speed when signal is 
given* 

Number of 
reports 

Average speed 

Average distance, 
to stop 

Miles per hour 


Miles per hour 

Feet 

Under 4.6 

2 

4.0 

6.0 

4. 6 to 9.6 

4 

7.8 

13.0 

9.6tol4.6 

17 

^2.2 

32.4 

14.6 to 19.6 

15 

17.1 

46.8 

19. 6 and over 

12 

22.2 

69.3 


* 4.6 to 0.6 means 4.6 and up to, but not including, 9.6. 


Groups of larger size. To return to our automobile problem. Since 
the speeds varied up to 25 miles per hour, and we have 50 reports to 
deal with, we might try breaking them up into 5 groups and see what, 
kind of averages that will give us. Using groups covering a range of 
5 miles per hour each, we can group the records and determine the 
averages for the 5 groups thus formed, getting the results shown in 
Table 12. 






52 


DETERMINING FUNCTIONS BY GROUP AVERAGES 


These averages can then be plotted and connected by straight lines, 
just as were the averages in Figure 5. In constructing Figure 6, which 
shows this process, it is necessary to use the average speed as well as 
the average distance-to-stop in locating each point. This is because 
each of the average distances, as shown in Table 12, represents not one 
speed, but several different speeds thrown together. If we wish to 
compare the average distances, it seems most sensible to compare 
them on the basis of the average of the speeds which tliey represent. 
The circles in Figure 6 represent the several group averages plotted 
this way. The first one is located at the intersection of the lines 

Distance 



Speed of auto— in miles per hour 

Fig. 6. Relation of speed of automobile to distance it takes to stop, us shown 
by averages of large groipis. 

for 4.0 miles per hour and 6.0 feet; the second at 7.8 miles i)er hour 
and 13.0 feet; and so on for the remainder. 

When the group averages of Figure 6 are connected by straight 
lines the relation between speed and distance is shown much more 
satisfactorily than it was in Figure 5. The line in the new figure 
shows a continuous relation between speed and distance. It indicates 
that, when the averages are taken from groups large enough to elimi- 
nate the effect of individual cases, the higher the speed the greater 
the distance it takes to stop. 

But on close examination even the relation shown in tins last figure 
is not found fully satisfactory. If we compute the cliange in distance- 
to-stop for each change of 1 mile in speed, we find that the conclusions 




GROUPS OF LARGER SIZE 


53 


are somewhat erratic. Between the first two averages, the change in 
speed from 4.0 to 7.8 miles per hour, an increase of 3.8 miles per hour, 
is accompanied by a change in distance from 6.0 to 13.0 feet, or an 
increase of 7.0 feet. Between 4 and 7.8 miles per hour, therefore, 
the distance-to-stop apparently increases 1.8 feet for each increase of 1 
mile per hour in the speed of the machine. Similar computations for 
all the other groups are shown in Table 13, carrying out just the 
same process. 

The results shown in Table 13 reveal that even the averages of 
Figure 6 are not altogether consistent. Between 4 and 8 miles per 

TABLE 13 


Computation of Change in Distance foe Each Change of One Mile in Speed, 
For Different Groups of Records 


Speed when 
signal is 
given 

Average 

speed 

Average 
distance to 
stop 

Increase in 
speed 

Increase in 
distance 

Increase in 
distance per 

1 mile 
increase in 
speed 

Miles per hour 

Mites per hour 

Feet 

Miles per hour 

Feet 

Feet 

Under 6 

4.0 

6.0 

1 






3.8 

7.0 

1.8 

6 to 10 

7.8 

13.0 

1 

1 






4.4 

19.4 

4.4 

10 to 15 

12.2 

32.4 

1 






I 4.0 

14.4 

2.9 

16 to 20 

17.1 

46.8 

1 






I ^ 

22.6 

4.4 

20 to 25 

22.2 

69.3 

1 




hour they indicate that the distancc-to-stop increases 1.8 feet for 
each increase of 1 mile in the speed of the machine; between 8 and 12 
miles per hour the distance suddenly starts increasing 4.4 feet for 
each 1 mile per hour increase in the speed of the machine; then be- 
tween 12 and 17 miles per hour the effect of further increase on the 
speed becomes less again, averaging only 2.9 feet increase in stopping 
distance for each increase of 1 mile per hour in speed; and then, 
finally, between 17 and 22 miles per hour changes again to 4.4 increase 
in feet to stop for each 1 mile increase in the speed of the auto. 

This same variability in the rate of change can be seen directly 
from Figure 6 by noting the steepness of the several portions of the 



64 


DETEEMINING FUNCTIONS BY GROUP AVERAGES 


line. Between 4 and 8 miles per hour, where there is the least average 
change in distance for each change in speed, the line has the least 
slope, that is, is the nearest horizontal. Between 8 and 12 miles, 
where the average distance to stop is much larger, the line tilts up 
abruptly ; then between 12* and 17 miles per hour, where the average 
change in distance is less rapid, the line is flatter again, tilting up 
once more for the more rapid rate of change shown by the last group. 
It should be noted, too, that the slope of the line is almost exactly the 
same between the 7- and 12-mile averages, and the 17- and 22-mile 
averages, illustrating the fact that in both these intervals the increase 
in distance was the same for each mile-per-hour increase in speed. 
The irregular and zigzag character of the line in Figure 6 therefore 
shows the same vacillation in the group averages that the computa- 
tions in Table 13 show. Simply by examining this chart closely it 
would have been possible to tell about this unsatisfactory character 
of the conclusions without taking the time to calculate out the exact 
rates. 

Are the irregularities shown in Table 13 and Figure 6 of any 
significance statistically, or are they due simply to the possibilities 
of variation in using so small a sample, just as were the differences 
in Figure 5 and Table 11? Is it really true that an increase in speed 
has a larger effect upon the distance required to stop between 7 and 
12 miles per hour than between 12 and 17? 

Reliability of group averages. The answer to these questions again 
involves a consideration of the statistical basis upon wliicli our con- 
clusions are based. These last results were calculated from the average 
speed and average distance for the several groups of records ; obviously 
they can be no more reliable than are those averages themselves. In 
measuring the reliability of those averages by the methods we have 
already discussed, the thing to do is to compute the standard errors 
which will tell us about how much confidence we can have in eacli 
figure. That means that, by calculating these statistical constants, 
we can judge at least the range within which the true average may fall, 
in two samples out of three, provided the sample is a random sample. 

The next step, therefore, is to calculate the standard error for each 
of the five averages of speed and the five averages of distance. The 
computation, which is exactly the same as that used before, based on 
equation (7.1) , is shown in Table 14. 

Comparing the several averages with their respective adjusted 
standard errors, as shown in the last column of Table 14, we find that 
there is not a great chance that if we made the same number of ob- 


GEOUPS OP LARGER SIZE 


55 


servations over again and used the same grouping, we should get 
averages different enough to change the location of the points ma- 
terially. But with regard to the distance required to stop, the averages 
are much less reliable. If we collected enough records to determine 

TABLE 14 


Computation' of Standard Errors for the Averages Shown in Table 12 


Group 

Number of 
cases, n 

Standard 
deviation, <r 

Computed 

standard 

error 

c 

y/n 

Range within 
which chances 
are f that 
average will 
fall* 

Average plus 
range for f 
probability f 


For speed 

Milea 'per hour 


Miles per hour 

Miles per hour 

Miles per hour 

Miles per hour 

Under 5 

2 

0 



4.0 =b ? 

5 to 10 

4 

0.83 

0.48 

0.58 

7.8±0.6 

10 to 15 

17 

1.39 

0.35 

0.36 

12.2 ±0.4 

15 to 20 

15 

1.41 

0.38 

0.40 

17.1 ±0.4 

20 and over 

12 

1.95 

0.59 

0.62 

22.2 ±0.6 


For distance 



Feet to stop 

Feet to stop 

Feet to stop 

Feet to stop 

Under 5 

2 

4.00 

4.00 

7.20 

6.0 ± 7.2 

6 to 10 

4 

6.71 

3.87 

4.68 

13.0 ±4.7 

10 to 15 

17 

16.09 

4.02 

4.18 

32.4 ±4.2 

15 to 20 

15 

17.62 

4.71 

4.90 

46.8 ±4.9 

20 and over 

12 

23.25 

7.00 

7.35 

69.3 ±7.4 


* These values are obtained by adjusting the computed standard error to indicate the range 
for which the probability is only 0.33 that the true average lies outside. By interpolating in Figure 
A, Appendix 3, the necessary adjustments to be applied to the computed standard errors are found 
to be: for 2 observations, times 1.80; for 4, times 1.21; for 16 or 17, times 1.04; and for 12, times 1.06. 

t In addition to the ranges shown here, there is a further margin of uncertainty due to the stand- 
ard error of these estimated standard errors. It ranges from 71 per cent for the smallest group to 
18 per cent for the largest. 

the several averages quite accurately, there is one chance out of three 
that we might find that the true distance for the first group was prac- 
tically nothing, or else more than 14 feet; or for the second group 
was less than 8 feet or more than 18 feet; and so on until for the last 
group it might be under 63 feet or over 77 fect.^ With this wide pos- 

2 If the standard errors of the estimated standard errors were also taken into 
account, the zones of uncertainty would be even wider. 






56 


DETERMINING FUNCTIONS BY GROUP AVERAGES 


sible variation in the true values, it is quite evident that the real 
facts have not yet been measured accurately enough to justify detailed 
computations of the differences in the slope of different portions of the 
line. By changing any one of the averages as much as has been indi- 
cated, the slope of the line would be very materially changed. 

Range within which true relation may fall. The extent to which 
reliance may be placed in the relationship between tlie two variables 
as shown by the 50 observations which we have to deal with may be 
judged from Figure 7. Here the actual averages have been plotted, 

Di stoncc 



Fig. 7. Relation of speed of automobile to distance it tukoH to as indicated 
by the range around group averages for which the probability i« rl that the true 

average is included. 

and lines drawn connecting them, just as before. But, in addition, 
rectangles have been drawn around each average to indicate the zone 
within which the true value would probably be found to lie if enough 
records were taken, using plus or minus the range for two chances 
out of three each way as the distance in laying off the rectangles 
from each average,® The corners of these rectangles have tlien been 

^ As the rectangles have been laid off with regard to both distance and speed, 
only in less than half the samples would the true values fall witliin the reclangh's. 
In two out of three such samples the average speed will not differ from the tinie 
average speed by more than the stated amount. Similarly, in two out- of tliree 
such samples the observed average distance will not differ from the true average 
by more than the extent calculated. Since f times f equals ii, only in four samples 
out of nine, on the average, would it be likely that both observed speed and dis- 
tance would fall within the calculated ranges from the true values at the same time. 


SUMMARY 


57 


connected by lines just as were the averages before. The probabilities 
now are that the line showing the true average relationship between 
speed and distance would run somewhere between these upper and 
lower boundaries, even though it might not be the particular irregular 
line of averages we have used so far. 

Figure 7 indicates that there is really a rather wide zone within 
which the true relation might fall, even when we take the zone as 
indicated by statements which will be incorrect one time out of three. 
For example, it indicates that machines traveling 15 miles per hour 
would probably stop in 36 to 46 feet after the brakes were applied, 
whereas those traveling 20 miles an hour would probably stop in 52 to 
68 feet. But this is still a pretty rough measure — ^would increasing the 
speed from 15 to 20 miles per hour increase the distance from 46 to 52 
feet, only 6 feet; or would it increase it from 36 to 68 feet, 32 feet? 
Of and by themselves, the data do not tell us. We do not yet have any 
general statement of the relation between speed and distance. 

We have seen how increasing the number of cases included in a 
single group increased the dependence which would be placed in that 
group. However, even by reducing our 50 cases to 5 groups we have 
not been able to get a consistent and satisfactory statement of the 
relation. Is it possible that by handling all the data as a single group 
we could get a better result? One way of doing this would be to 
average all the speeds and all the distances together. But that would 
only tell us what was the average distance to stop and the average 
speed. What we want to know is what distance is most likely to be 
required at any given speed, and the treatment just suggested would 
not give us that. 

There is one way, though, of determining the relation while con- 
sidering all the records together. If we are willing to assume that 

an increase of one mile per hour in the rate of speed will increase 

the distance required to stop by exactly the same number of feet, 
no matter how rapidly or how slowly the machine is already moving, 
then we can determine this relation for all the data as a whole. On this 
basis a straight line can be used to represent the relation. All that we 
have to do is to determine a straight line which will come as near as 

possible to representing the relation as shown by all 50 individual 

observations. 

Summary. The change in one variable with changes in another 
may be approximately determined by grouping the records according 
to the independent variable and determining the corresponding aver- 
ages for the dependent variable. Unless a very large number of 



68 


DETERMINING FUNCTIONS BY GROUP AVERAGES 


observations is available, however, the functional relation shown by 
the successive averages will be irregular and inconsistent, owing 
solely to sampling variability. For that reason some method is needed 
for measuring the functional relation for the group of records as a 
whole. The simplest way in which this can be done is by assuming 
that the relation can be represented by a continuous straight line. 
Methods of determining such a line will be considered in the next 
chapter. 

Note 1, Chapter 4. As already noted earlier in this chapter, it is always pos- 
sible to reverse the dependent and the independent variables. Thus the data 
presented in Figure 3, on page 38, might have been plotted with time as the inde- 
pendent variable and with distance fallen as the dependent. A curve might then 
have been drawn in to show the average distance which a body can traverse for a 
given time of fall. Similarly, the data charted in Figure 4, on page 44, might 
have been charted with distance as the abscissa and speed as the ordinate. The 
data would then be in shape to consider the question, what is the average speed 
of cars which require a given specified distance to stop? The functions which 
express these relations are not exactly the reciprocal of the functions which ex- 
press the reverse relation. That is, when 


and 


F=/(Z) 
X = <f>(Y) 


KX) ^ 


0F 


The reasons for this will be considered subsequently. 



CHAPTER 5 


DETERMINING THE WAY ONE VARIABLE CHANGES WITH 
ANOTHER: (2) ACCORDING TO THE STRAIGHT- 
LINE FUNCTION 

There are a good many ways by which a straight line can be deter- 
mined to show the functional relation between the two variables, 
speed and distance. One way would be simply to place a ruler over the 
chart along the several group averages, or to stretch a black thread 
over them, and draw the line in by eye so as to fall as nearly as pos- 
sible along them. Although no two persons would draw their lines ex- 
actly the same, still this method might give fairly satisfactory results 
where only a rough measure was wanted. In the present case, how- 
ever, in view of the expensive field work necessary to collect the data, 
it would seem worth while to put as much clerical time on analyzing 
those we have as is needed to give the most accurate results. We shall 
therefore use the exact correlation method of determining the straight 
line. 

The equation of a straight line. The determination of what • 
this line will be consists in finding the constants for the equation of 
the line. Just as we have already seen (Chapter 3) that the curve 
showing the relation between the distance a body has to fall and the 
time it takes can be expressed by the relation, 

so any straight line can be expressed by the relation ^ 

Y ^a + hX (8) 

1 Written this way, the equation is a perfectly general one which can be applied 
to the relation between any two variables, by calling one of them Y and the other 
one X. The symbol Y in the equation simply represents the number of units of 
the variable we designate as Y, whatever that may be, acres, dollars, pounds; and 
the symbol X likewise represents the number of units of the variable we designate 
as X, Thus if X is the number of rooms in each of a series of houses, X may be 4 
for the first house, 7 for the next, 6 for the next, and so on. When we write X we 
then mean the number of rooms in each house, no matter how large or how small 
that number may be in any particular case. The particular number which X rep- 
resents in any given case is said to be the value of X, Thus for a house of 5 
rooms, we should say “the value of X is 5.” 

59 



60 


SIMPLE LINEAR REGRESSION 


Figure 8 illustrates the meaning of a and b in this formula 
When the value of X is 0, b times X is zero, and Y is equal to a. 
This constant, a, therefore, gives the height of the line (in terms of 
Y or vertical units) at the point where X is zero. This is indicated 
at the left edge of the chart. 

From the same equation, every time X increases one unit, Y 
increases h times one unit, since Y is computed as a plus h times X. 
The difference of the height of the line (measured in Y units) between 
the point where X is 1 and where X is 2, is therefore h units of F, just 
as indicated on the chart. And this continues to hold true for every 


Value of 



Fig. 8. Graph of the function 7 r=r a + hX, 

unit change in X, whether from 1 to 2, or from 0 to 1, or from 
99 to 100. 

The meaning of these constants in the equation of the straight 
line, as equation (8) is known, may be illustrated more concretely 
by taking some actual values for the constants a and b, and seeing 
how the line would look then. If we take 3 for a, and 2 for 6, tlie 
equation would then read: 

F = 3 + 2X 

Figure 9 shows the line for which this is the equation. Tlviis if 
X is taken as zero, the value of F is found to be 


y = 3 + (2 times 0) = 3 + 0 = 3 


And 3 is therefore the F value corresponding to the X value, zero 
Similarly if X is taken as 10, 


F = 3 + (2 times 10) = 3 + 20 = 23 




THE EQUATION OF A STRAIGHT LINE 


61 


And the Y value corresponding to the X value of 10 is therefore 23. 
All other values of Y which may be computed for values of X within 
the range shown in Figure 9 will similarly be found to lie exactly 
on the same line. 

Figure 9 illustrates again the meaning of the constants a and h. 
When X is zero, the value of Y 
is three units above zero, as in- 
dicated, and for every unit in- 
crease in X (say from 5 to 6) the 
value of Y goes up 2 units. This 
is exactly the same thing as shown 
in Figure 8, except that there no 
definite values were assigned to a 
and b, whereas here they have 
been given exact numerical values. 

To represent the general rela- 
tion between the speed of an auto- 
mobile and the distance it takes 
to stop, therefore, we can use this 
same kind of equation, letting X stand for the speed in miles per 
hour and Y stand for the distance-to-stop in feet. 

Thus when we write the equation: 


Value of 



3 + 2X. 


Y = a + bX 


we shall be using that as shorthand for 

Feet to stop = a + b (speed in miles per hour) 

But to give this equation definite meaning we must determine the 
numerical values for a and b, just as in our previous illustration we 
had to assume numerical values for these constants before the graph 
had any definite meaning for us. 

The ^‘observation equations” One way of finding what the values 
should be is by regarding each one of our original observations (Table 
10) as an algebraic equation itself. Thus the first observation, 2 feet 
to stop at 4 miles per hour, would be written 

2 = a -h b (4) 

putting the 2 feet in place of Y in the equation and the 4 miles in 
place of X. 




62 


SIMPLE LINEAR REGRESSION 


Similarly the next observation, 4 feet to stop at 7 miles per hour^ 

would be expressed ^ 

4 = a + 6 (7) 

and so on right through to the last observation, 32 feet to stop at 
20 miles per hour, which would be written — 

32 = d h (20) 

Bringing all these different equations together would give a series 

looking like this : ^ 

2 = a + 46 

4 = a + 76 

50 = a + 176 


80 = a + 146 
32 = a -f- 206 

(The middle equations are omitted here to save space.) 

Since we had 50 original observations, we should have 50 different 
equations, each one containing the two unknown constants a and 6. 

Now by the rules of simple algebra, any two independent equations 
containing two unknown constants can be solved simultaneously to 
obtain the numerical values for those constants. One way to find 
the values of our unknown a and b would be to pick two of the equa- 
tions representing our observations and solve them simultaneously. 
Suppose we take the first and the last ones; we shall then have: 

0, -|- 46 = 2 

a 206 = 32 

Solving these two equations simultaneously, we find the values 

a = - 5| 
b = l| 

But in getting these values we have used only 2 out of the 50 
observations. Should we have got the same result if we had used 
another pair? Suppose we take the second observation and next tc 
the last — 

Then a + 7b = 4 

o + 145 = 80 



THE EQUATION OP A STRAIGHT LINE 63 

These equations, solved simultaneously, give the values 

a = -72 
h = lOf 

which are certainly far different from those secured before. Appar- 
ently the values secured by this method would depend upon the 
particular pair of observations selected, perhaps varying with each 
pair. 

If we work out estimated values for Y for given values of X by 
these two solutions, we get estimates as follows: 

According to the first result, 

7 = - 5.5 + 1.875X 

when X = 10, 7 = 13.25; when X = 20, 7 = 32 

According to the second result, 

7 = - 72 + 10.86X 

when X = 10, 7 = 36.6; when X = 15, 7 = 90.9 

If we should then plot the two calculated points for the first of 
these equations, and connect them by a straight line, we, should find 
that that line also passes through the two dots which represent the two 
observations from which the values were calculated. Similarly, if 
we should plot the two computed points for the second equation, and 
pass a straight line through them, that also would pass through the 
two dots which represent the values from which it was calculated. 
Clearly, therefore, fitting a line to two observations is merely deter- 
mining the line that passes through them. We could compute as 
many different lines as there are different pairs of observations not 
lying on the same line. 

Fitting a straight line to two points, as^ we have done here, is 
simply equivalent to drawing a line to pass through those two points. 
This is evident in Figure 9A. Here the dot chart shown originally as 
Figure 4 has been replotted. The dots used in computing the above 
equations have been designated by crosses. The two lines computed 
have been plotted in. Quite clearly no single line could pass through 
all the different points. If we computed more lines by this process 
of using selected pairs of points, we should just get a larger variety of 
different lines. 



64 


SIMPLE LINEAR REGRESSION 


Fitting the line by ^^least squares.” If we are going to use a 
mathematically determined straight line at all, what we need is one 
which represents all 50 observations instead of any particular pair 
of them. No one line can exactly fit all 50 observations, for, as we 
have just seen, the line which would agree with the first and the last 
would not agree at all with the second and next to the last. What 
we shall have to find is some compromise line which will come as near 
as possible to agreeing with all the 50 observation equations, even 
though it does not exactly agree with any one. Mathematicians have 
worked out a method of obtaining such a line by the use of what is 


Distance to stop in feet 



Fig. 9A. Data for uu'omobile problem, and straight lines fitted to pairs of 
individual observatiotiH. 

known as the “method of least squares.” Altliough the process of 
determining the values of the constants a and b by this method is 
somewhat complicated, it takes all the observations into account, and 
gives each one of them an equal weight in the process. It is therefore 
of very great value in handling problems of this sort. 

The equations upon which the process is based are derived by 
the use of calculus, and their derivation is given in Note 2, Ap])endix 2. 
The method itself, however, is very simple and can be used by anyone 
having a knowledge of simple algebra. 

Computing the extensions. The individual observations are first 
listed as shown in Table 15. The speed in miles per hour is placed 



FITTING THE LINE BY 'XEAST SQUARES^’ 65 

TABLE 15 

Computation of Values fob Determination of Line bt Least Squares 


Speed in miles 
per hour, X 

Distance to stop 
in feet, Y 


XY 

4 

2 

16 

8 

7 

4 

49 

28 

17 

60 

289 

850 

14 

36 

196 

504 

12 

20 

144 

240 

11 

28 

121 

308 

20 

48 

400 

960 

15 

54 

225 

810 

17 

40 

289 

680 

13 

34 

169 

442 

15 

26 

225 

390 

19 

68 

361 

1292 

10 

26 

100 

260 

18 

66 

324 

1008 

22 

66 

484 

1452 

18 

84 

324 

1512 

8 

16 

64 

128 

4 

10 

16 

40 

12 

14 

144 

168 

20 

56 

400 

1120 

23 

54 

629 

1242 

18 

76 

324 

1368 

12 

24 

144 

288 

16 

32 

250 

512 

18 

42 

i 324 

756 

19 

46 

361 

874 

24 

93 

576 

2232 

14 

26 

196 

364 

12 

28 

144 

336 

9 

10 

81 

90 

10 

34 

100 

340 

15 

20 

225 

300 

24 

70 

576 

1680 

25 

85 

625 

2125 

20 

64 

400 1 

1280 

19 

36 

361 

684 

13 

26 

169 

338 

10 

18 

100 

180 

7 

22 

49 

154 

16 

40 

250 

640 

14 

60 

196 

840 

20 

52 

400 

1040 

24 

120 

570 

2880 

24 

92 

576 

2208 

17 

32 

289 

544 

13 

34 

169 

442 

11 

17 

121 

187 

13 

46 

169 

598 

14 

80 

196 

1120 

20 

32 

400 

640 

Totals, 770 = SX 

2,149 = 2 V 

13,228 = 2(-Y2: 

1 38,482 = 2:(Xy) 



G6 


SIMPLE LINEAR REGRESSION 


under the heading and the distance-to-stop in feet is placed 
under the heading Then each X item is squared, and entered 

in the column headed and each X item is multiplied by tbo 

accompanying Y item, and entered in the column headed 
Then all the items in each column are summed, giving the totals at 
the foot of each column. Just as before, in computing the standard 
deviation, we shall use the symbols to represent the sum of all 

the X items; to represent the sum of all the Y items; 

to represent the sum of all the X^ items; and similarly, 
we shall use ^^2(Xy)^^ to represent the sum of all the products in 
the XY column. 

Solving the equations. Having obtained these values as indicated 
in Table 15, we can next proceed to find the values of a and b by 
the aid of the following formulas: 

^ X(Xy) - nM^My 

^(X^) - n(M^)^ ^ ^ 

a == My — hMx (10) 

In using these formulas the value of b is determined first, then 
it is used in the next formula to determine the value of a.'*^ 



770 


= — 


= 15.4 

n 

50 ' 


27 

2149 


My = 


= 42.1 

^ n 

50 



^ It should be noted that if both X and Y had been stated in terms of deviation 
from their mean values (just as was done when the standard deviation, a, was eorn- 
puted in Table 6), they would have been denoted by the symbols small x and small 
y. If the product shown in the fourth column of Table 15 had then been obtained 
by multiplying together these two values, it would have been designated ocy, and 
its sum, '^{xy). The correction factors used in the first part of the formula (9) 
just given are used simply to change the product sum of the original observations, 
2(XF), to what it would have been if it had been computed from the deviations 
of the mean instead. That is to say, 

2(X7) -nMJ\Iy^i:{xy) (11) 

Similarly, 2(X^) ~ == 2(x^) 

Hence 6 = 2(x?/)/S(a:^) 

Equations (9) and (10) are only another way of stating the 'normal equations,^ 



FITTING THE LINE BY “LEAST SQUAEES” 67 

Using the values for SZ, 27, 2(Z"), and 2Z7 given in Table 
15, in equations 9 and 10, we find the values of b and a to be : 

^ _ 2(Xy) - nMxMy ^ 38,482 - 50(15.4) (42 .98) 6,387.4 

S(X2) - 13,228 - 50(15.4) (16.4) ~ 1^3^ ^ 

a-= My - bM^ = 42.98 - (3.93)(15.4) = - 17.64 

The equation for the straight line, as thus determined by all the 
observations, is therefore 

r = - 17.54 + 3.93X 

(For an exercise, plot this line in on the dot chart shown in Figure 4, 
on page 44.) 

This line is called the line of best fit, since it is the line which 
gives, for all the 50 observed values of X, values of Y which come 
as near as possible to agreeing with all the different Y values observed. 
While some equations, such as the two computed from 2 observations 
each, would come closer than would this one for some individual cases, 
they would be much farther off for other cases; this one comes closer 
to agreeing with all the cases than any other straight line.® 

Estimating Y from X. We can see just how the equation for 
this line works by taking any given value for X we wish and working 
out what the estimated value for Y would be. That is, we can take 

v'hich can be solved simultaneously to give the values for a and 6. These equations 
are 

na 4- (2X) b = SK 
(SX)a + (EX^)h - IXY 

Those two equations can be solved simultaneously to get the values for a and 
h which will best fit all the equations, in the same way that the previous paired 
observations were put into simultaneous equations and solved simultaneously to 
got the values which would exactly fit the tWo observations. 

The method by which this line is fitted rests upon the assumption that the 
scatter of the individual observations around the fitted line will approximate, a 
normal distribution. If one or two observations are exceedingly erratic as com- 
pared to the others, so that the scatter of the observations around the line will 
be very skew, this method of fitting may be unsatisfactory. 

^ The way in which this equation gives the best fit may be explained mathe- 
matically. If the differences between each of the actual observations and the 
estimated values given by this equation are computed, squared, and summed, 
that sum will be smaller than it would be if any other straight line were used. 
Since this method determines the line with the smallest possible squared devia^ 
tions, the line is known as the “least-squares” line, and the method of com^ 
p\iting it is known as the “method of least squares.” 



68 


SIMPLE LINEAR REGRESSION 


any initial speed we wish and compute from the equation what would 
be the most probable distance required to stop, on the basis of the 
straight-line relationship. 

If 14 miles per hour is taken, X will be 14. Substituting this 
value in the equation gives the estimated value of 7. 

y = ~ 17.54 + 3.93 (14) 

= - 17.54 + 55,02 
= 37.48 

So the number of feet which would probably be required to stop, 
when traveling at 14 miles per hour, would be about 37.5 feet. Com- 
paring this with the original observations, we see that the 4 cars 
recorded at this speed stopped in 36, 26, 60, and 80 feet, respectively. 
At 23 miles per hour the single car observed took 54 feet to stop. 
What estimate will the equation give for that speed? Let us see: 

7 = - 17.54 + 3.93 (23) 

= - 17.54 + 90.39 
= 72.85 

This is much higher than the single observation. But referring 
to Figure 4 we see that that observation fell far below the general 
trend of the other observations. The straight-line equation, based 
on all the observations, thus seems to give a more reliable estimate 
of the distance which is most likely to be required to stop at any 
given speed than does any one individual observation. 

But how far is it true that the straight line gives the most accuirate 
estimate? Will it hold true for a speed of 1 mile per hour or for 
a speed of 50? Let us see. 

For 1 mile per hour the equation becomes : 

7 = - 17.54 + 3.93 (1) 

= - 17.54 + 3.93 
= - 13.61 

For 50 miles per hour it gives: 

7 = - 17.54 -t- 3.93 (50) 

= - 17.54 + 196.5 
= 178.96 



FITTING THE LINE BY “LEAST SQUARES’ 


69 


Of these two results, only the latter sounds at all sensible. To 
say that a machine moving 1 mile per hour stops in minus 13.61 feet 
is saying that it stopped 13.61 feet back of where the brakes were 
applied, which is certainly nonsense. On the other hand, to say that 
a machine traveling 50 miles per hour would stop in about 179 feet 
after the brakes were applied might be quite reasonable — if we had 
any direct evidence for machines traveling at that speed. But that we 
do not have. All that we have are observations on 50 machines travel- 
ing at rates varying from 4 to 25 miles per hour. Since we have no 
observations for speeds below 4 miles per hour, we cannot expect 
our equation to be of any reliability below that point; and, since we 
have no observations of speeds above 25 miles per hour, we cannot 
be sure that our equation will give good estimates beyond that point. 

Only within the range covered by the original observations can an 
estimating equation of this type be used. 

Of the 50 observations, there were 6 below 10 miles per hour and 
only one above 24, so 43 out of the 50 were between 10 and 24 miles 
per hour. For that reason no great reliance can be put in the equation 
below 10 miles per hour and above 24 miles per hour. Only within 
those limits where the bulk of the observations fell can the equation 
really be trusted.'^ For that reason the final equation, showing the 
average relation between speed and distance for automobiles, should 
be written: 

y = — 17.54 + 3.93 (Z), for values of X between 10 and 24 

Then the application of tlie equation is limited to the range given, 
and there is no danger of its being used to give absurd values for 
speeds too low or untested values for speeds too high. 

Now that tlie limits of the line have been considered, it may be 
well to compare it to the group averages used before, to see how this 
single line, based on all the observations, compares with the irregular 
line obtained when the observations were grouped. This can be done 
conveniently by drawing in the line on Figure 7, which showed not 
only the line of averages but also the limits within which those aver- 
ages were probably correct. This comparison is shown in Figure 10. 
The straight line determined by the least-squares solution has been 

^ See pages 113 to 121 for a discussion of the type of problem in which a formula 
may be used to make estimates beyond the range covered by the data. See also 
Chapter 18 for formulas for estimating the standard errors for a and b. 



70 


SIMPLE LINEAR REGRESSION 


drawn in solidly for the range of speed in which most of the observa- 
tions fell and has been dotted in for the remainder of the range.® 
Comparing the straight line with the group averages and the error 
limits within which they probably would fall, we see that the line 
does fall within those limits in every case but one, and in that case 
it just barely misses it. That shows that, so far as indicated by the 
number of observations we have on which to base the results, the 



I " y' 

0 5 10 15 ZO Z5 


Speed of auto — In miles per hour 

Fig. 10. Relation of speed of automobile to distance-to-stop as indicated by 
ranges around group averages and by least-squares straight line. 

straight line may serve as a more reliable indication of the general 
relation than does the irregular line of the group averages. 

The estimated distance required to stop, for each speed considered, 
is shown by the corresponding ordinate of the line in Figure 10. The 
estimated values may also be obtained by substituting the X value 
in the equation, just as has been done for the observations at 14 miles 
and at 23 miles. Carrying out this computation gives the estimated 
values shown in Table 16. Subtracting the estimated distances from 
the actual distances gives the residuals, or the difference between the 

^ This line is drawn in according to the equation by determining the Y values 
for any two convenient values of X, and then drawing a straiglit line connt'cting 
them. Thus if the values at the end of the bulk of the observations, 10 and 24, are 
taken for X, the accompanying values for Y are found to be 21.8 and 76.8. These 
Y values are then plotted opposite 10 and 24 for X ; a straight line drawn connect- 
ing them ; and extended as a dotted line to cover the rest of the range. 




INTERPRETING THE LINEAR EQUATION 


71 


two values. The symbol z is used in the table to designate these differ- 
ences. The average of these differences, taken without regard to sign^ 
is 11.6 feet; their standard deviation is 15,07 feet.® 

TABLE 16 


Speed of Auto, Distance to Stop, and Distance Estimated from Speed 
BT Linear Equation 


Miles 

Actual 

Estimated 

Residual 

Miles 

Actual 


Residual 

per hour, 

distance, 

distance. 

iY-Y’), 

per hour. 

distance, 

distan' e. 

(Y-Y'), 

X 

Y 

7' 

WBm 

X 

7 

7' 

Z 

4 


-1.8 



46 

67.1 

-11.1 

7 


10.0 



93 

76.8 

16.2 

17 

60 

49.3 

0.7 

14 

26 

37.5 

-11.6 

11 

36 

37.6 

-1.6 

12 

28 

29.6 

-1.6 

12 

20 

29.6 

-9.6 

9 

10 

17.8 

-7.8 

11 

28 

26.7 

2.3 

10 

34 

21.8 

12.2 

20 

48 

61.1 

-13.1 

15 

20 

41.4 

-21.4 

15 

64 

41.4 

12.6 

24 

70 

76.8 

-6.8 

17 

40 

49.3 

-9.3 

26 

86 

80.7 

4.3 

13 

34 

33.6 

0.4 

20 

64 

61.1 

2.9 

15 

26 

41.4 

-15.4 

19 

36 

57.1 

-21.1 

19 

68 

67.1 

10.9 

13 

26 

33.6 

■“7.6 

10 

26 

21.8 

4.2 

10 

18 

21.8 

-3.8 

18 

66 

63.2 

2.8 

7 

22 

10.0 

12.0 

22 1 

66 

68.9 

-2.9 

16 

40 

46.3 

-6.3 

18 

84 

63.2 

30.8 

14 

60 

37.6 

22.6 

8 

16 

13.9 

2.1 

20 

62 

61.1 

-9.1 

4 

10 

-1.8 

11.8 

24 

120 

76.8 

43.2 

12 

14 

29.6 

-16.6 

24 

92 

76.8 

15.2 

20 

66 

61.1 

-5.1 

17 

32 

49.3 

-17.3 

23 

64 

72.9 

-18.9 

13 

34 

33.6 

0.4 

18 

76 

63.2 

22.8 

11 

17 

26.7 

-8.7 

12 

24 

29.6 

-6.6 

13 

46 

33.6 

12.4 

16 

32 

46.3 

-13.3 

14 

80 

37.6 

42.6 

18 

42 

63.2 

-11.2 

20 

32 

61.1 

-29.1 


Interpreting the linear equation. Just what does the line of least 
squares tell us, now that we have decided it is a fairly accurate indica- 
tor of stopping distances — at least within the range 10 to 24 miles? 
We can answer that by trying to explain what the constants a and b 

oThe significance of this standard deviation of the residuals is explained on 
pages 129 and 494. 








72 


SIMPLE LINEAR REGRESSION 


of the equation mean — ^the values —17.54 and 3.93, which we de- 
termined by least squares. 

The first of these constants, a, is merely an empirical value to 
place the height of the line. If observations available and the type 
of equation used were such that they could be expected to give a 
sensible value for the distance to stop when X was zero — ^that is, 
when the machine was not moving — then a would give that value, 
since when Z = 0, Y = a. But, of course, when a machine is not 
moving, it does not take it any distance to stop, so in this case the 
a has no sensible interpretation at that 'point. But that is to be 
expected — as has been seen, the line as a whole has but little meaning 
below 10 miles per hour, and none at all below 4 miles; which was 
the lowest speed covered by the records. The constant a, therefore, 
has no meaning of and by itself in this particular example, but merely 
serves to place the height of the line as a whole for that range within 
which the line does have some meaning. 

The constant 6, on the other hand, is always significant. It shows 
the difference in Y for every difference of one unit in X, on the 
average of all the observations, and within the range covered. In 
this particular problem, the value of 3.93 for b indicates that between 
4 and 24 miles per hour each increase of one unit in *Y, that is to say, 
each increase of one mile per hour in speed, causes on the average an 
increase of 3.93 units in Y — ^that is, of 3.93 feet in the distance re- 
quired to stop. This interpretation of b can always be made, and 
is one of the most significant results secured by determining the con- 
stants for the straight line. In comparison with the values shown in 
Table 13, ranging from 1.8 feet to 4.4 foot increase in stopjung distance 
for each one mile increase in speed, this figure of 3.03 f(‘ct per mile 
increase in speed is seen as a sort of weighted average, averaging 
together all the different possible sorts of comparisons like those in 
Table 13." 

The value determined for 5, like the value previously doterminefl for the 
mean yield of corn, is not the tnie value for nil the cars in the city Htxidied, btit 
is only the estimate of that value as determined from the ciira included in the 
sample. Just as the sample mean may vary from the true mean for the universe, 
so the b computed from the sample may vary from the inie h for the universe. 
Likewise, the possible extent of that variation may bo indicated by estimatinp; ite 
standard error. The increase in distance-to-stop for each additloniil mile in speed 
should be stated as 

3.93 feet ± (standard error of b) 

Pages 312 to 315 show how to calculate the standard error of b and explain ita 
meaning more fully. 



INTEEPRETING THE LINEAR EQUATION 


73 


It should be noted that even though the straight line does fall 
within the standard error limits of most of the averages, as it does 
in this case, that by itself is no proof that the straight-line formula 
really expresses the true underlying relation between the speed of a 
machine and the distance that it takes it to stop in this example. 
It is a purely arbitrary method of describing relation, which ap- 
parently expresses the observed relation fairly well; but that is all. 
It is, after all, only an empirical expression of the relationship; and 
because it happens to agree fairly well is no proof that it expresses 
the true nature of the relation. In fact, there is as yet no proof that 
it is even the best empirical description of the observed relation that 
can be obtained; further tests, to be described in the next chapter, 
are necessary. 

But whether or not the straight line is the best function in this 
particular example, it is a type of relation of very great importance 
and usefulness. It is one of the simplest functions to fit and to ex- 
plain, and for that reason it is very widely used. The equations used 
in determining the constants of the equation (equations [9] and [10], 
page 66) are therefore of great importance. The student of analytical 
statistics should become thoroughly familiar with the methods of de- 
termining the constants of the equation and should understand thor- 
oughly both the meaning and the limitations of this type of analysis. 

Determining the constants for the linear equation for a given set 
of observations is called “ ^fitting^ the equation to the data.’^ Because 
the linear equation is one of the simplest of all equations to ^^fit,’^ it is 
widely and frequently used. In many cases, no other possible relation 
is even considered. Actually, however, the linear equation is very 
limited in its logical meaning. By its very nature, it can represent 
only a situation where the change in the dependent variable, for a 
unit change in the independent variable, would be expected to be 
just the same regardless of how large or how small the independent 
variable was. This is a very precise and narrow relation. In many 
sets of relationship, the relation which theoretically would be ex- 
pected would be a changing relationship as the value of the inde- 
pendent variable changed, instead of this unchanging relationship. 
Unless there is a good logical reason to expect the linear equation to 
represent truly the situation present, fitting a straight line can be re- 
garded only as an empirical exercise, with no meaning to the constants 
obtained beyond the purely formal one of specifying the straight 
line that most nearly represents the data. 



74 


SIMPLE LINEAR REGRESSION 


Summary. To express a functional relationship by a straight 
line, the constants may be determined arithmetically by the ^'method 
of least squares ” Such a line gives the “line of best fit'^ under the 
assumptions of that method: a normal distribution of the observa- 
tions around the line and the reduction of the squared residuals to a 
minimum. Estimates of the dependent variable may be made accord- 
ing to the linear function for any value of the independent variable. 
Only within the range which includes the bulk of the independent 
values does this estimate have meaning, however; and only then if 
the straight line gives a satisfactory expression of the observed rela- 
tion, either empirically or logically. 

Note 1, Chapter 6. Just as a straight line can be fitted to show the average 
distance-to-stop for each, given rate of speed, so another straight line can be fitted 
if the variables are reversed. In that case the speed, miles per hour, could be 
regarded as the dependent or V variable, and the distance-to-stop, feet, would be 
regarded as the independent or X variable. Working out the values of a and b 
for this reverse statement of the problem will be left as an exercise for the stu- 
dent. In line with the note to Chapter 4, it will be found that the value of this 

new b is not equal to i as previously determined, but will differ slightly from it. 

0 



CHAPTER 6 


DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 
ANOTHER CHANGES: (3) FOR CURVILINEAR FUNCTIONS 

A straight-line equation is frequently a fairly good empirical state- 
ment of the relation between two variables even when the true rela- 
tion is more complex than the straight line can portray. Yet it may 
be just as important to know the exact or approximate nature of the 
relationship as it is to have an empirical statement of it. For that 
reason it is necessary to consider other ways of expressing a relation- 
ship than the straight line. 

In the automobile-stopping case we have been using as example, 
Figures 4 and 10 showed that the straight line agreed fairly well 
with the averages from the observations. Closer examination of the 
figures, however, reveals that for speeds below 10 miles per hour the 
actual stopping distance was usually greater than is indicated by the 
line; for speeds 10 to about 17 miles per hour the average stopping 
distance was about the same as indicated by the line; above 20 miles 
per hour the stopping distance was frequently much greater than 
is indicated by the straight line. These considerations rob the line 
of much of its usefulness for the purpose for which the study was 
started — to serve as a basis for establishing speed limits. The linear 
relation between speed and stopping distance is apparently not accurate 
above 20 miles per hour, tending to underestimate the distance required 
at higher speeds. Since that might be the very range within which 
it was desired to set the speed, the conclusions most needed for that 
particular purpose would be lacking. 

The real difficulty involved is in the assumption that the straight- 
line function applies. We have assumed that an increase of one mile 
in the speed of the car increases the distance required to stop by 
the same number of feet, no matter how fast the car is already travel- 
ing. When we examine Figures 5 and 10 closely, we see that this is 
not correct; the line of averages slants up slowly at first, then tends 
to rise more steeply as the speed is increased, until it has the steepest 
slope at the highest speed. It is therefore incorrect to assume that 



76 


SIMPLE CURVILINEAR REGRESSION 


we can express the relation by determining the average increase in 
stopping distance for an increase of one mile in the rate of speed; 
for the increase in stopping distance is not the same regardless of 
the rate of speed, hut tends to become greater as the rate of speed 
increases. Only if our expression of the relation can express that fact 
too will it sum up all our observations with sufficient accuracy. 

What is needed is some general way of stating the relation between 
speed and distance, similar to the general relation expressed in the 
straight-line formula, yet expressing a changing relationship instead 
of the uniform linear relation shown by the straight line. 

Different types of equations. In the same way that it is pos- 
sible to represent relations mathematically by a straight line, it is 
possible to represent them by curves of various types. We have seen 
how the equation F = a + bX can be used to represent any straight 
line by determining the proper values to be assigned to the constants 
a and b. There is practically no limit to the different kinds of curves 
which can be similarly described by mathematical equations. The 
equations of a number of curves which are useful in statistical analysis 
of the relations between variables are: 


Y = a + hX+cX^ 

(a) 

log F = a + bX 

(0 

log F = a + b log X 

(c) 

Y = a + blogX 

(d) 

^ “ a + bX 

(e) 

F = a + bX + cX^ + dX^ 

(/) 

F = a + bX + c 

(f7) 

Each of these equations can be used to represent a certain type 
of curve. Thus type (a) is the equation of a parabola. If we take 
certain values for the unknown constants a, b, and c, substitute them 
in the formula, work out the values of Y for various values of A"", and 
plot them the same as we did before, we will see the sort of curve 
this equation can be used to express. Thus if we take 1 for a, 0.5 


for b, and — 0.1 for c, the equation will read: 


7 = 1 + 0.5X - O.IX^ 



DIFFERENT TYPES OF EQUATIONS 


77 


When the value of X is 0, 7 will be 1, obviously. When Z is 1, 7 
will be 

7 = 1 + 0.5 (1) - 0.1 (l^) 

= 1.4 

When X is 2, 7 will be 

7 - 1 + 0.5 (2) - 0.1 (2^) 

= 1 + 1 - 0.4 
= 1.6 

Similarly, when X is 3 

7 = 1 + 0.5 (3) - 0.1 (3^) 

= 1 + 1.5 - 0.9 
= 1.6 

For X equal to 4 

7 = 1 + 0.5 (4) - 0.1 (42) 

= 1.4 

and for X = 5 

7 - 1 + 0.5 (5) - 0.1 (52) 

== 1 

and for X = 6 

7 = 1 + 0.5 (6) - 0.1 (62) 

= 0.4 


Plotting each of these values on cross-section paper and drawing 
a smooth curve through the several points, we get the result shown in 
Figure 11 in the center of the top section. Examination of the figures 
above and of this chart discloses one characteristic of this type of 
curve — the curve is always symmetrical on both sides of the highest 
point — the point where it stops going up and starts to turn down 
(as half way between X = 2 and X = 3 in this case). The value 
of 7 when X = 2 is the same as when X = 3. When X = 1 it is the 
same as when X = 4 and, for X = 5, 7 is the same as when X = 0. 
As a result the curve could be cut into halves at the point of turning 
downward, one of which would be the reverse of the other. Besides 
this characteristic symmetry, this curve has another peculiarity — 
it has one, and only one, change from moving upward to moving down- 



78 


SIMPLE CURVILINEAR REGRESSION 


ward, no matter 'what values are assigned to a, b, and c, or how far 
it is carried out. For the equation shown, the curve reaches its highest 
point when X = 2.5. As shown in Figure 11, the curve continues 
downward on both sides of this point, no matter how large the positive 
or negative values of X become. Thus if X = 100, 

F = 1 + 0.6 (100) - 0.1 (lOO*) 

= 1 + 50 - 1000 
= -949 

^ ~ F = 1 + 0.5 (-100) - 0.1 (-100^) 

= 1 - 50 - 1000 
= - 1049 

If the value of b were negative and of c were positive, the curve 
would then be concave from above instead of convex and would be 
symmetrical with respect to its lowest point. 

Because of the characteristics mentioned, this type of curve is not 
very satisfactory to represent many types of relations. It does have 
great flexibility, in that many differently shaped curves can be repre- 
sented by some particular segment of the parabola; but on the other 
hand the parabolic shape itself is so simple that many times the real 
relation between the variables cannot be represented by a parabola. 

The characteristics of a number of other types of simple curves 
are also illustrated in Figure 11. In each case an equation of the 
type indicated has been assumed, and the values of Y corresponding 
to values of X have been computed as has just been done for the 
simple parabola. Then plotting these computed values gives the 
curves shown. Thus type (/), the cubic parabola, is seen to have one 
maximum point and one minimum point and one point of inflection 
(the point where the curve changes from concave from above to con- 
vex, or vice versa). No matter what values are assigned the constants 
in this equation, it can have only the single inflection and the two 
points of maxima and minima. Of course the particular data to be 
represented might fall anywhere along the entire course of the curve 
— if only a single change from positive to negative slope were required, 
the point of inflection in the cubic parabola might lie beyond the 
extremes of the data, and so not show at all when the fitted curve 
was plotted for the range covered by the data. 

Figure 11 also illustrates curves of types (6) to (e), as well as 
some others not given special type designations. In each case where 



79 


DrPPEEENT TYPES OF EQUATIONS 

the log of Y is used in place of it is evident that the previous 
curve has been modified as if by compressing the ordinates nearest 
zero and stretching out the ordinates farthest away from zero, stretch- 
ing them more and more as they depart more and more from zero 
This process transforms the straight lines of 7 = a + hZ to a curve 
concave from above when log 7 = a -h 6Z is used instead; or, when log 



^ r.u t substituted for 7 = a + 6Z + cZ^ it lengthens 

out the top of the bend if h is positive, or flattens out the bottom of the 
dip if h is negative. Similar results are found with the cubic parabola. 

Similarly, when log X is used in place of X, the previous curves 
are modified as if the abscissas were compressed near zero, and 
stretched out in the higher values. This changes the straight line 




80 


SIMPLE CURVILINEAR REGRESSION 


of F == a + hX to a curve for F = a + 6 log Xy convex from above 
when h is positi^-and concave from above when b is negative. The 
parabolas are similarly transformed, making the slopes different on 
each side of the bend in the simple parabola or on each side of the 
inflection in the cubic. The effect is to move the ''hump” or "dip*' 
in nearer to the zero abscissa and to stretch out the remainder of 
the curve (including the second bend, in the case of the cubic 
parabola) . 

When logarithms are used for both X and F, the effect is to modify 
both sets of coordinates in the manner previously described. The 
curve^ log F = a 4- b (log X) may have either a concave or convex 
bend if 6 is positive, but is always concave from above if b is negative. 
Similar modifications are noted in the case of the simple parabola. 

In any event it should be noted that the curves whose equation^ 
contain logarithms retain some of the same characteristics as those 
with similar equations without logarithms. Thus the linear equa- 
tions (with only a and b) never change from a positive to a negative 
slope; the simple parabola always has one such change, if carried out 
far enough; and the cubic parabola always has two such changes. 
In addition, it should be noted that a variable can be stated in terms 
of logarithms only if it has no negative values. Whereas the other 
functions can express negative values as readily as positive ones, the 
logarithmic curves always become asymptotic as they approach zero — 
that is, they tend to flatten out and to run almost parallel with the axis. 
This is because a logarithm cannot be obtained for a negative number. 
No matter how small a logarithm becomes, the corresponding anti- 
logarithm is still positive, even if only a very small decimal fraction. 

The hyperbola (type [e]) shown just below the center of Figure 
11 also is peculiar in that it can become asymptotic as it approaches 
both the X axis and the F axis, even if one or both of the variables 
are in negative values.^ However, the values of X and F which it ap- 

^ three types of simple hyperbolas which are frequently useful in 

curve fitting: 

^ a -f- equilateral hyperbola, asymptotic to a line parallel to the X axis; 

F = a 4- 6 ^ is an equilateral hyperbola asymptotic to a line parallel to the 

Y axis; 

“ = a H- 6 is an equilateral hyperbola asymptotic tc lines parallel to both 



DIFFERENT TYPES OF EQUATIONS 


81 


proaches are not the zero values, as with the logarithmic curves, but 
special values which vary in each particular case and depend upon 
the value of the constants a and b in the equation. Still more complex 
curves of the same hyperbolic type may be obtained by including 
higher powers of X, such as 


a + bX + cX^ 

Still other curves may be represented by hybrid equations, which 
combine two or more of the simple types described thus far. Thus 
t-ype ig) is a compound of a simple linear equation and a simple 
hyperbola. This is sometimes useful to represent curves which cannot 
be represented by the simpler types. The choice of an equation to 
represent a particular set of data, however, depends upon logical 
analysis as well as upon the empirical ability of a given equation to 
represent the relation found. This matter is discussed at length sub- 
sequently on pages 113 to 125. 

The equations discussed to this point all have one characteristic in 
common. They can all be fitted to the data by relatively elementary 
arithmetic operations, as will be shown subsequently. There are many 
other types of more complicated equations which cannot be fitted so 
readily. These can reproduce curves with recurrent or periodic oscilla- 
tions, growth curves, and other complicated biological or physical 
phenomena. Discussion of the use and fitting of such complicated 
curves lies outside the scope of this book.^ 

The inability of any one equation to represent many simple curves 
may be illustrated by taking a different example from the automobile- 
stopping case we have been considering previously. Table 17 shows a 
scries of observations of two variables — the protein content of dif- 
ferent samples of wlicat, as determined by chemical analysis, and the 
proportion of ^‘Iiard, dark, vitreous kernels'^ in each sample, as de- 
termined by visual examination with the naked eye. The relation here 
is quite different from the one we have been considering so far. There 
is no causal connection between these two variables in the sense of 
one’s being caused by the other. Instead, they are merely two differ- 
ent ways of measuring the character of the wheat. It is a short, 
rapid process, however, to examine the samples by eye and determine 

2 For examples of such complicated curves and methods of fitting; them, see 
Frederick E. Croxton and Dudley J. Cowden, Applied General StatiMics, pp. 540- 
671, 441-462, Now York, Homy Holt and Co., 1940. 



82 


SIMPLE CURVILINEAR REGRESSION 


the percentage of hard, dark, vitreous kernels, whereas it is a long and 
expensive process to run a chemical test on each lot. For that reason 
it is of importance to know whether it is possible to estimate the 
protein content from the percentage of vitreous kernels, and, if so, 

TABLE 17 

Protein Content and Proportion op Vitreous Kernels for Each of a Number 
OP Samples op Wheat* 


Sample number 

Protein content 

Proportion of 
vitreous kernels 


Per cent 

Per cent 

1 

10.3 

6 

2 

12.2 

75 

3 

14.5 

87 

4 

11.1 

55 

5 

10.9 

34 

6 

18.1 

98 

7 

14.0 

91 

8 

10.8 

45 

9 

11.4 

51 

10 

11.0 

17 

11 

10.2 

36 

12 

17.0 

97 

13 

13.8 

74 

14 

10.1 

24 

15 

14.4 

85 

16 

15.8 

96 

17 

15.6 

92 

18 

15.0 

94 

19 

13.3 

84 

20 

19.0 

99 


* These values are actual items, picked so as to show the relationship more clearly. Actually, 
the correlation is not so high as is shown by these selected oases. 


how closely. So even though the vitreous kernels do not cause the 
differences in protein, we can still regard the proportion of vitreous 
kernels as the independent variable and the percentage of protein 
as the dependent variable. That means only that we are going to try 
to estimate the dependent (protein) from the independent (percentage 




^FITTING” A SIMPLE PARABOLA 


83 


of vitreous kernels) even though there is no direct cause-and-effect 
relation present. 

The relation between the proportion of vitreous kernels and the 
per cent of protein may be seen more readily if a dot chart is made, 
showing the two variables for each of these individual observations. 
According to the previous discussion, we shall regard the proportion 
of kernels vitreous as Z, the independent variable ; and the percentage 
of protein as the dependent variable, Y, In preparing the dot chart, 
shown in Figure 12, we shall therefore plot the X values, or percentage 



0 20 40 60 80 100 

X-Vitreous kernels, in percent 


Fig. 12. Dot chart showing relation of proportion of vitreous kernels to 
protein content of wheat. 


of vitreous kernels, along the horizontal axis and the Y values, the pro- 
portion of protein, along the vertical axis. 

It is quite obvious from an inspection of the figure that a straight 
line would not do to represent the change in protein with change in 
vitreous kernels. Some type of curve is necessary. Let us see if the 
simple parabola is the proper type of curve. 

‘‘Fitting” a simple parabola. To represent the relationship be- 
tween the two variables according to the formula 

Y = a + hX + cX^ (12) 

we shall have to determine from the 20 observations the values to assign 
to the constants a, b, and c, just as before for the straight line we 
had to determine values for a and b. (Of course the a and b for 




84 


SIMPLE CURVILINEAR REGRESSION 


the parabola will not be the same as the values for the straight 
line — ^unless c happens to be zero, which would make the equation 
for the parabola give a straight line instead.) The values for these 
constants are determined by constructing and solving the following 
equations: ® 

(Sa:^)6 + (Sxw)c = Sarj/l 

C2xu)b + (Lu^)c = 'Zuy 
and 

a = My- h{M^) - c{Mu) ' (14) 


The values necessary in constructing equations (13) and (14) are 
derived as follows: 

Use U to represent the values of equation (12).'* 

Then 


ZX ZU ZY 

; Mu= — ; My - — 

n n n 

Zx^ = ZX^ - nMl 

Zxu = ZXU - nM:cMu 

Zu^ ^ZU^ - nMl 

Zxy = ZXY - nM:,My 

Zuy = ZUY - nMuMy 


(15) 


After computing these values, the two equations (13) are solved 
simultaneously to obtain the values for h and c, and then these 
values are substituted in equation (14) to obtain the value for a. 

Table 18, following, shows the form of computation in the first 
step to obtain these values for the data of Table 17. * 

® An alternative method is to solve the following three equations simultane- 
ously. The clerical work is about the same in both methods. 

na + (ZX)h + (ZV)c = ZY 
{ZX)a + (ZX^)h -1- {ZUX)c = ZXY 
(ZU)a + (2UX)b + (2C/2)c = ZYU 

These equations are derived by the process explained in Note 2, Appendix 2. 

^ If U is made equal to X^ divided by some convenient number, say 1,000, the 
volume of necessary arithmetic can be materially reduced, without affecting the 
accuracy of the result. See Note 3, Appendix 2, for proof. 



“FITTING" A SIMPLE PAEABOLA 


85 


TABLE 18 


Computation, for Wheat Pbobmm, op Values Needed to Dbtbbminb 
Constants op the Simple Parabola 


Per cent 
vitreous 
kernels 

X 

Per cent 
protein 
(minus 10)’' 

r. 

. X^mdV 

XU 

C/2 

XY 

UY 

6 

76 

87 

56 

34 

98 

91 

46 

51 

17 

36 

97 

74 

24 

85 

96 

92 

94 

84 

99 

0.3 

2.2 

4.5 

1.1 

0.9 

8.1 

4.0 

0.8 

1.4 

1.0 

0,2 

7.0 

3.8 

0.1 

4.4 

6.8 

5.6 

6.0 

3.3 

9.0 

36 

6,625 

7,569 

3.025 
1,166 
9,604 
8,281 

2.025 
2,601 

289 

1,296 

9,409 

6,476 

676 

7,225 

9,216 

8,464 

8,836 

7,056 

9,801 

216 

421,875 

668,503 

166,375 

39,304 

941,192 

753,571 

91,125 

132,651 

4,913 

46,656 

912,673 

405,224 

13,824 

614,126 

884,736 

778,688 

830,684 

692,704 

970,299 

1,296 

31.640.626 
57,289,761 

9.160.626 
1,336,336 

92,236,816 

68,574,961 

4.100.626 
6,765,201 

83,621 

1,679,616 

88,629,281 

29,986,676 

331,776 

62,200,626 

84,934,656 

71,639,296 

78,074,896 

49,787,136 

96,059,601 

1.8 

166.0 

391.5 

60.5 

30.6 

793.8 

364.0 

36.0 

71.4 

17.0 
7.2 

679.0 
281.2 

2.4 

374.0 

656.8 

515.2 

470.0 

277.2 

891.0 

10.8 

12.375.0 
34,060.6 

3,327.5 

1,040.4 

77.792.4 

33.124.0 
1,620.0 

3.641.4 
289.0 
259.2 

65.863.0 
20,808.8 

67.6 

31.790.0 

53.452.8 
47,398.4 

44.180.0 

23.284.8 

88.209.0 

1,340 

68.5 

107,566 

9,259,238 S 

124,403,226 

5,985.6 

642,584.6 


CSee J”„trrApU!d“iJ°2r"‘* ° fron. each protein reading 

rr. ^ 

The values at the foot of the table give tlie values called for in 

8v,rln^^ values as computed for those shown 

symbolically, the arithmetic appears as follows: 


My 

Mu 



27 


n 

n 


1,340 

20 

68.5 _ 
20 

107,566 

20 


= 67 
3.425 
= 5,378.3 



SIMPLE CUEVILINEAR EEGRESSION 


SX® - nMl = 107,666 - 20(67)^ = 17,786 

SXU - = 9,259,238 - 20(67) (5,378.3) = 2;052,316 

SU® - nMl = 824,403,226 - 20(6,378.3)^ = 245,881,008 

SX7 - nJIfxJkfj, = 5,985.6 - 20(67)(3.425) = 1,396.1 

'LVY - nM„My = 542,584.6 - 20(5,378.3) (3.425) = 174,171.06 

These calculations give the values needed in equations (13), which 
are to be solved simultaneously to obtain the values of b and c. Sub- 
stituting the values just computed in the equations gives the two equa- 
tions to be solved as follows: 

(A) il^3p)b + (Sxu)c='Sxy 1 | 17,78664- 2,052,316c= 1,396.1 

(B) (Sxm)6-|-(Su2)c = (Sm2/)| 12,052,3166-1-245, 881, 008c= 174,171.05 

The simplest way to solve these is by the Doolittle method, as indi- 
cated in Appendix I, page 464. 

Solving the equations simultaneously gives b = ~ 0.0879, c = 
0.001442. These values are then substituted in equation (14) to obtain 
the value for a. 


a = My- 6(ilf*) - c(Jlf„) 

= 3.425 - (-0,0879) (67) + (0.001442) (5,378.3) 

= 4-1.56 

With our values for a, b, and c, we can now write out the equation 
for the parabola, 7 = a 4- 6X 4- cX* (12), for this particular case as 
follows: 

7 = 1.56 - 0.088X 4- 0:00144X2 

Since 10 was subtracted from the percentage of protein before calcu- 
lating the equation,® to estimate the actual percentage 10 must be added 
back in, making the equation read 

7 = 11.56 - 0.088X 4- 0.00144X2 

This then is the equation of the simple parabola which comes 
nearest to describing the relationships between 7 and X. From it 
the percentage of protein in a given sample of wheat may be estimated 
from the percentage of hard, dark, vitreous kernels in that sample. 

= See Note 3, Appendix 2, for proof that this does not affect the values obtained 
for 2 (a: 2 ), 2 (a; 3 /), etc. 



'^FITTING” A SIMPLE PARABOLA 


S7 


We can see how the estimates are made by working them out for 
some of the samples. If we take the values of X for the first five 
samples in Table 18 — 6, 75, 87, 55, and 34, for example — and substitute 
them in equation (I) above, we obtain estimated values for Y as 
follows: 

When X - 6 

. Y = 11.56 - 0.088(6) + 0.00144(36) = 11.08 

When X = 75 

Y = 11.56 - 0.088(76) + 0.00144(5625) = 13.06 
When X - 87 

7 = 11.56 - 0.088(87) + 0.00144(7669) = 14.80 
When X = 55 

Y - 11.56 - 0.088(55) + 0.00144(3025) = 11.08 
When X = 34 

Y = 11.56 - 0.088(34) + 0.00144(1156) = 10.23 

Substituting each of the values of X in the formula in turn in 
a similar manner, we obtain estimated values for Y as shown in Table 
19. So as to distinguish between the actual values of 7, and the 
values for 7 estimated from X according to the equation of the 
parabola, we shall designate the latter as 7' values. 

It is quite apparent from the table that the actual and the esti- 
mated values generally fall rather near each other, the estimates part 
of the time being too high and part of the time too low. We can 
get a better idea of the relation between the estimated and actual 
values by plotting both on a dot chart (Figure 13), similar to the way 
we did in Figure 12, using dots as before to represent the values of 7 
originally observed and crosses to represent the estimated values, 7'. 
Since the 7' values are all computed from the formula, the crosses all 
lie on a continuous smooth curve, which we can sketch in freehand, as 
indicated by the dotted line in the figure. Now if we want to estimate 
the protein for a sample with a proportion of vitreous kernels not 
included in our problem, say 65 for example, we can determine it 
either by substituting 65 for X in equation (I) , and computing it out, 
or by reading from our smooth curve the 7 value corresponding to 
an X value of 65. Of course this graphic interpolation j as it is called, 
will not be quite so exact as will the actual computation, but for many 
purposes the result will be sufiiciently accurate. 



88 


SIMPLE CURVILINEAR REGRESSION 


Let us now examine Figure 13 and decide whether the formula 
for the parabola gives a satisfactory “fit” in this case — ^whether the 
estimated values do agree fairly well with the actual. We see at 
once that the curved line of the estimates does come closer to agreeing 
with the actual values than any straight line could. But on the other 

TABLE 19 

Comparison, for Wheat Problem, op Actual Protein Content with Protein 
Content Estimated From Per Cent of Vitreous Kernels on Basis of 
THE Simple Parabola 


Per cent vitreous 
kernels, X 

Per cent protein 
(minus 10), Y 

Estimated per cent 
protein (minus 10), 
Y' 

Difference between 
actual and esti- 
mated protein, 

(Y - y') 

6 

0.3 

1.08 

-0.78 

75 

2.2 

3.06 

-0.86 

87 

4.5 

4.80 

-0.30 

55 

1.1 

1.08 

-fO.02 

34 

0,9 

0.23 

+0.67 

98 

8.1 

6,79 

+ 1.31 

91 

4.0 

5.50 

-1.50 

45 

0.8 

0.52 

+0.28 

51 

1,4 

0.83 

+0.57 

17 

1.0 

0.48 

+0.52 

36 

0.2 

0.20 

-0.06 

‘ 97 

7.0 

6.60 

+0.40 

74 

3.8 

2.95 

+0.85 

24 

0.1 

0.28 

-0.18 

85 

4.4 

4.51 

-0.11 

96 

5.8 

6.41 

-0.61 

92 

5.6 

5.68 

-0.08 

94 

5.0 

6.04 

-1.04 

84 

3.3 

4.35 

-1.06 

99 

9.0 

6.99 

+2.01 


hand we see that the general shape of the parabolic curve and the 
general trend of the actual relationship is rather different. For low 
proportions of vitreous kernels, the estimated values are generally 
too low; for the highest proportions, they are also generally too low; 
whereas for proportions of vitreous kernels ranging from 70 to 95 
per cent, the estimates are too high. 




FITTING” A CUBIC PARABOLA 


89 

Apparently the equation of the simple parabola is not adequate to 
describe this particular relationship. Especially for high proportions 
of vitreous kernels, the estimates are quite inaccurate. For 99 per 
cent vitreous, the parabola would estimate 17.0 per cent protein, 
whereas both samples over 97 per cent vitreous kernels had over 18 
per cent protein. The failure of this curve to give a satisfactory 
“fit” is not due to any error in the computations but merely to the 
fact that this formula cannot give the proper-shaped curve to fit the 
relationship in this case. The mathematical properties of the equa- 
tion itself are such that, no matter what constants are used for a, 6, 

protein content 



0 20 ^0 60 80 100 
X~ Vitreous kernels, In percent 

Fro. 13. Dot chart showing relation of vitreous kernels to protein content of 
wheat, and parabolic curve fitted to same. 

and c, it cannot come any closer to describing the true relation. The 
method just used in computing a, 6 , and c gives the best values for 
this case; any other three values substituted in the same formula 
would do even less well in “fitting” this particular set of observations. 

“Fitting” a cubic parabola. The cubic parabola, type (/) of 
the equations on page 76, might be tried to see if it would describe 
this particular relationship more closely. 

The equation of the cubic parabola, 

Y = a + bX + cX^ + dX^ (16) 

has four constants a, b, c, and d to be computed. Here again, of 
course, a, b, and c will be different from those we have computed 




SIMPLE CURVILINEAR REGRESSION 


90 

previously, unless the d value comes out zero. The values 6, c, and 
d are computed by the simultaneous solution of the following three 
equations: ® 

Use U to represent the of equation (16) and V to represent 
the X3. 


(Sa;^)b + (l^xu)c + (2xv)d = Xxy 
(I,xu)h + (Sw^)c + (^uv)d = H/uy 
(Iixv)b + (^uv)c + (2v^)d = llvy 


(17) 


The value for a is then computed from the following equation: 


a = My- h{M^) - c{Mu) - d{M^) ( 18 ) 


The values for ^xu^ ^xy, 2^2, and 'huy are com- 

puted as shown previously, equations (15). The additional values 
required in equation (17) are computed as follows: 



n 

l^uv = SC77 ~ nMuM^ 
= SX7 - nM^M^ 
= 272 ^ nMl 


2vy = 277 - nM,My 


( 19 ) 


It should be noted that among the values required to ^'fit” this 
cubic parabola, that is, to determine the constants a, b, c, and d, are 
such values as 272 and 2(77. Remembering that 7 = and 
U — X2, we need to calculate and X^, For X = 10, X^ — 1,000,000, 
so for values of X such as those in Table 17, ranging from 6 to 99, it 
would take a tremendous volume of computation to compute the 
values required in equations (17), (18), and (19). This may be 
reduced by letting U = X 2 /IOO, and 7 = X3/10,000. The computa- 


®The alternative method here involves the simultaneous solution of 4 equa- 
tions, as follo'ws: 

na + (ZX)b + (■2U)c + (2V)d = SF 
(SX)a + + (ZXU)c + (ZZF)d = ZXY 

(2U)a + S(C;X)5 + (SV^c + (2t/F)(i = ZUY 
(S7)a + {ZVX)b + (Sl7F)c + (Zr‘)d = SFF 



TITTING^' A CUBIC PARABOLA 


91 


tion is not shown here in detail. It follows the general form of that 
given in Table 18; and the solution of the equations (17), starting in 
just as shown on page 200, may be most conveniently carried through 
by the method shown subsequently on page 464. 

Even when the cubic parabola is ^‘fitted” to the data given, how- 
ever, it does not give a satisfactory “fit.” Thus Figure 14 shows 
the cubic parabola fitted to the data, worked out as just described. 
The values found gave the equation 

Y = 0.35 + 0.0345X - O.1397(XVl00) + 0.1788(XVlO,000) 
or, clearing of fractions,^ 

Y = 0.35 + 0.0345X - O.OOUX^ + O.OOOOISX^ 

Adding in the 10 which was subtracted from Y before making the com- 
putations, the equation becomes 

Y = 10.35 + 0.0345X - O.OOUX^ + 0.000018X® 

In Figure 14, the original observations are represented by dots, the 
estimated values from the cubic parabola are represented by stars, 
and the curve of the simple parabola is also shown. A curve has been 
drawn through the stars to show the general shape of the cubic 
parabola. 

The last curve comes much closer than the previous curve to 
describing the relationship which actually exists. Even so, however, 
it is not entirely satisfactory, for it gives estimates which are still too 
low at the very highest percentage of vitreous kernels. Except for 
this portion, and the downturn at the beginning, it seems quite 
satisfactory. 

There are still other types of curves, however, some of which might 
give better fits than the ones we have tried. For instance the fourth- 
order parabola, 

Y = a + bX + cX^ + dX^ + eX^ 

can be fitted by an extension of the methods just described, as can 
parabolas with even more terms. Those are rarely useful, however, as 
the greater the number of terms, the greater the tendency becomes for 
the curve to “wiggle.” in addition, the volume of arithmetic required 
becomes extremely burdensome — ^the computations for the fourth- 
order parabolas involving powers of X up to Z®. 

^See Note 3, Appendix 2, for proof of this step. 



92 


SIMPLE CURVILINEAR REGRESSION 


Furthermore, there are only a limited number of observations, 
20 in all. If a parabola were fitted with 20 constants, for example, 
it would simply twist and turn so as to pass through every observa- 
tion. Since it would simply reproduce these 20 observations, it would 
be of no value at all in indicating the relation which probably holds 
true in the universe from which the observations in the sample are 
drawn. (See Chapters 18 and 22 for further discussion and mathe- 


Protein confent 
in per cent 



Fig. 14. Dot chart, with parabola and cubic parabola. 

matical measures of this question of the sampling significance of a 
fitted curve.) 

Fitting lines or parabolas to time series. In studying time series, 
it is sometimes desirable to fit a straight line or a curve to the suc- 
cessive observations as a means of determining the long-time trend. 
The techniques of time-series analysis lie outside the scope of this book, 
and therefore are not given especial consideration here.® Fitting a 
mathematical trend to a time series involves regarding the succes- 
sive months or years as values of the X, or independent, variable. 
The fact that these values are regularly spaced, 1, 2, 3, 4, etc., and 

® An excellent discussion of the methods and meaning of time-series analysis is 
given by Frederick C. Mills in his textbook, Statistical Methods, Chapters VII, 
VIII, and XI, revised edition, Henry Holt and Co., New York, 1938. See also 
Max Sasuly, Trend Analysis of Statistics, The Brookings Institution Washington 
1934. 




«fitting^^ a logarithmic curve 


93 


that the same succession reoccurs in many problems, makes possible 
special methods and special tables, which greatly reduce the labor 
of fitting the equations. This method of computation, known as 
orthogonal 'polynomials, should be used in determining lines or para- 
bolic curves for such data.® 

“Fitting” a logarithmic curve. Some of the other types of curves 
mentioned on page 76, particularly types 6, c, and d, involving 
logarithms, and type e, using reciprocals, may be fitted with relatively 
little computation. The methods of fitting one of each of these types 
may be shown for the present case, even though they may fail to give 
any better fit than the curves which have already been computed. 

The three simple types of logarithmic curves, b, c, and d, may all 
be fitted by exactly the same method previously used in fitting a 
straight line, except that the logarithms of X, of Y, or of both together 
are employed where otherwise the values of the variables themselves 
are used. Comparison of the straight-line formula with the logarithmic 
formula indicates how this is done. 

If we use Y to represent the logarithms of the Y values, and X to 
represent the logarithms of the X values, our equations will change as 
follows: __ 

(b) log y -= a + bX, to y = a + bZ 

(c) log y = a + b log X, to y = 0 - 1 - bX 

(d) Y = a + h\ogX,ioY = a + bX 

In each case it is evident that the new equation is identical in form 
with the simple straight-line equation, 

= a -h bX 


and the same methods may therefore be used in determining the con- 
stants a and b as were used earlier in equations (8) to (11). 

Some indication as to which one of the three logarithmic formulas 
will come nearest to fitting a given set of data can be obtmned by con- 
verting both the X and Y values to logari Ams, vari^les X and Y, and 
then making dot charts of Y against X, of Y against X’, and of Y against 
X. If one chart shows the dots falling in substantially a straight line 

® For methods of fitting orthogonal polynomials, see Frederick E. Croxton and 
Dudley J. Cowden, Applied General Statistics, pp. 433-35, Prentice-Hall, Inc., New 
York, 1940, and R. A. Fisher, Statistical Methods for Research Workers, seventh 
edition, Oliver and Boyd, Edinburgh and London, 1938, pp. 148-155. 



94 


SIMPLE C3URVILINEAR REGRESSION 


the equation corresponding to that chart will give the most satisfactory 
flt“ 

The first step in applying any one of the three logarithmic equa- 
tions to the data of the wheat example is to work out the logarithms 

TABLE 20 

VabiabIjEB in Wheat Probuem and Logarithms or Values 


Per cent protein 

Y 

Per cent vitreous 
kernels 

X 

Logarithms of Variables:* 

Protein 

Y 

Vitreous kernels 

X 

10.3 

6 

1.013 

0.778 

12.2 

75 

1.086 

1.876 

14.5 

87 

1.161 

1.940 

11.1 

55 

1.045 

1.740 

10.9 

34 

1.037 

1.531 

18.1 

98 

1.268 

1.991 

14.0 

91 

1.146 

1.959 

10.8 

45 

1.033 

1.653 

11.4 

51 

1.057 

1.708 

11.0 

17 

1.041 

1.230 

10,2 

36 

1.009 

1.556 

17.0 

97 

1.230 

1.987 

13.8 

74 

1.140 

1.869 

10.1 

24 

1.004 

1.380 

14.4 

85 

1.158 

1.929 

16.8 

96 

1.199 

1.982 

15.6 

92 

1.193 

1.964 

15.0 

94 

1.176 

1.973 

13.3 

84 

1.124 

1.924 

19.0 

99 

1.279 

1.996 


* Losarithms to base 10. 


and construct the three dot charts, to indicate which formula to use. 
The form of computation is shown in Table 20. 


This is strictly true only if the ^^goodness of fit” is measured in terms of the 
logarithms used. ^ 

Logarithms may also be used with parabola of higher orders, such as: 

LogY = a + hX -\-cX^ 

Such involved curves will not be considered at length in this book, however. 




"FITTING^* A LOGARITHMIC CURVE 


95 


It should be noted that in working out the logarithms nothing can 
be added or subtracted from any of the variables (except for round- 
ing off decimals) In all the previous work the protein had been 
stated as protein in excess of 10 per cent, but now the original per- 
centage figures are used once more. That is because logarithms deal 
with relative values, and the relation of 1 to 2 is quite different from the 
relation of 11 to 12 . All the previous equations have dealt with abso- 


L09 y(?) 
1.2 


i.i 


i.o 


Y) 

Y 


* 






• ^ * 




15 

/ 

• 


“7“ 

, 


*• 

• • . * 


• 

■ ■ ‘1 

)0 

1 . • ■ 1-. 



100 as 1.0 1.5 

Loj 31 (5^) 

1 


0.5 1.0 1.5 2.0 

LQg3:(5<) 


2.0 


Fia. 16. Dot charts illustrating log F = /(Z); rt= /(log X) ; log F = /(log X). 


lute values or differences from the average; and the absolute difference 
between 1 and 2 is of course just the same as that between 11 and 12. 

Figure 15 gives the three dot charts in which the three different 
ways of combining the logarithmic and actual values are shown. 
None of the three gives a very close linear relation, but the one where 
Y and X are plotted seems to come nearest. The equation 


log 7 = a + or F = a + 6X 
will therefore be used. 


After the logarithms are once computed, however, they can be “coded'' by 
subtracting a constant or by division, just as other variables have been treated 
formerly, with the same effect on the final constants obtained. 



96 


SIMPLE CURVILINEAR REGRESSION 


The values necessary to determine a and b are as follows, using 
equations (9) and (10) : 


szy, My, zz"* 


Table 21 shows in full the computation of these values from the 
original values of the two variables. 


TABLE 21 

Computation, for Wheat Problem, op Values Needed to Determine 
Constants for Logarithmic Curve 


Per cent 
protein 

Y 

Per cent 
vitreous 
kernels 

X 

Logarithms of 

Y 

Y 

Extensions 


XY 

10.3 

6 

1.013 

36 

6.078 

12.2 

75 

1.086 

5,625 

81.450 

14.5 

87 

1.161 

7,569 

101.007 

11.1 

55 

1.045 

3,025 

57.475 

10.9 

34 


1,156 

35.258 

18.1 

98 


9,604 

123.284 

14.0 

91 


8,281 

104.286 

10.8 

45 


2,025 

46.485 

11.4 

51 


2,601 

53.907 

11.0 

17 

1.041 

289 

17.697 

10.2 

36 

1.009 

1,296 

36.324 

17.0 • 

97 

1.230 

9,409 

119.310 

13.8 

74 

1.140 

5,476 

84.360 

10.1 

24 

1.004 

576 

24.096 

' 14.4 

85 

1.158 

7,225 

98.430 

15.8 

96 

1.199 

9,216 

115.104 

15.6 

92 

1.193 

8,464 

109.756 

15.0 

94 1 

1.176 

8,836 

110.544 

13.3 

84 

1.124 

7,056 

94.416 

19.0 

99 

1.279 

9,801 

126.621 




= 107,566 

^XY = 1.546.8S8 


This computation gives the values necessary to compute a and b 
by formulas (9) and (10). 





“PITTING” A LOGARITHMIC CURVE 


97 


The averages of X and Y of course are; 

1,340 „ 

- V - ir - 


n 


20 


Then 


_ SZF - _ 1,545.888 - 20(67)(1.11945) _ 

SZ^ - nM'i 107,566 - 20(67)^ • 

and 

a = My- 1{M^) = 1.11945 - (0.002576) (67) = 0.9469 

In terms of the variable, the equation required is therefore 

? = a + 6X = 0.9469 + 0.002576X 
or 

log y = a + 6X = 0.9469 + 0.002576X 


The percentage of protein can now be estimated from the propor- 
tion of vitreous kernels observed for any sample gf wheat, by sub- 
stituting the percentage of vitreous kernels (the X values) in this equa- 
tion and working it out. Thus for the first example, with 6 per cent 
of vitreous kernels, it would work out as follows: 


log y = a + = 0.9469 + 0.0026(6) 

log y = 0.9624 


Using a table of logarithms we find that the number corresponding to 
the logarithm 0.9624 (that is to say, its antilogarithm) is 9.17. The 
estimated proportion of protein is therefore 9.17 per cent. 

Similarly if the proportion of vitreous kernels in the second sample, 
75, is substituted in the equation, the work to calculate the estimated 
proportion of protein is: 

log y = a + => 0.9469 0.002576(75) 

log y = 1.1401 
antilog 1.1401 = 13.81 


The estimated proportion of protein is therefore 13.81 per cent. 



98 


SIMPLE CURVILINEAR REGRESSION 


Table 22 shows this computation carried through for each of the 
20 observations. 


TABLE 22 

Computation, for Wheat Problem, of Estimated Protein Content from 
Per Cent of Vitreous Kernels on the Basis of a Logarithmic Curve 
(Log F = 0.9469 + 0.00258 X) 


Per cent 
vitreous 
kernels 

X 

Estimated per cent protein 

Actual 
per cent 
protein 

F 

Percentage errors in 
estimating protein 
proportion 

ioo (|, - I . oo ) 

Estimated 

logarithm 

P 

Antilog of 
estimate 

F' 

6 

0.9624 

9.2 

10.3 

+ 12.0 

75 

1.1401 

13.8 

12.2 

-11.6 

87 

i 1.1710 

14.8 

14.5 

- 2.0 

65 

1.0888 

12.3 

11.1 

- 9.8 

34 

1.0343 

10.8 

10.9 

+ 0.9 

98 

1.1993 

15.8 

18.1 

+14.6 

91 

1.1813 

15.2 

14.0 

- 7.9 

45 

1.0628 

11.6 

10.8 

- 6.9 

61 

1!0783 

12.0 

11.4 

- 6.0 

17 

0.9907 

9.8 

11.0 

+ 12.2 

36 

1.0396 

11.0 

10.2 

- 7.3 

97 

1.1968 

15,7 

17.0 

+ 8.3 

74 

1.1375 

13.7 

13.8 

+ 0.7 

24 

1.0087 

10.2 

10.1 

- 1.0 

85 

1.1669 

14.7 

14.4 

- 2.0 

96 

1.1942 

15.6 

16.8 

+ 1.3 

92 

1.1839 

16.3 

15.6 

+ 2.0 

94 

1.1890 

15.6 

15.0 1 

- 3.2 

84 

1.1633 

14.6 

13.3 

- 8.9 

99 

1.2019 

15.9 

19.0 

+19,5 


It should be noted in this table that errors made in estimating tlie 
proportion of protein are stated as relative errors rather than absolute 
errors. That is done because the thing that is really estimated is the 
logarithm of the percentages of protein, or Y, and the errors are 
really the differences between the actual logarithms and the estimated 
logarithms. If z is used to stand for the error, in this case z is really 
in terms of logarithms, that is: 

z — log Y — estimated log F, or F — F' 


^FITTING” A LOGAEITHMIC CURVE 


or in terms of natural numbers: 

, antilog Y actual Y 

anti-log Z = ' = — 

antilog 7' estimated Y 

Subtracting the constant 1.00 and multiplying by 100 changes this 
relative figure to the percentage which the observed value is above 
or below the estimate.^^ 

Where log Y is taken as the dependent variable, as has been done 
here, fitting the equation by the methods just shown involves making 
the square of the logarithmic residuals around the line as small as 
possible. That means that instead of minimizing the sum of the 
absolute errors, squared, as heretofore, we now minimize the sum of the 
'percentage errors, squared. In some cases it may be desired to use 
the logarithmic curve, yet to continue to minimize the absolute errors. 
Relatively simple methods are available to accomplish that result.^® 


12 The reason for making this distinction will be seen later on, when the ques- 
tion of measuring the accuracy of the estimate is taken up. 

To fit the equation 

logF^a+baogX) 

under the conditions that the sum of the squares of the absolute departures of the 
estimated values, 7', from the actual values, 7, will be as small as possible, deter- 
mine the values of a and h by solving the equations 

2(F2)a 4-2(72X)6 = 21^7 

2(72jr)a + S(72P)5 = 

where 7 = log 7, and H = log X, as above. 

To compute the several sums involved in these equations, the following form 
may be used: 


X 

Y 



7 

Y^X 


yiy 


6 

76 

10.3 

12.2 

106.09 

148.84 

0.778 

1.876 

1.013 

1.086 

82.64 

279.08 

83.61 

303.08 

107.47 

161.64 

mm 

■ 

Sums 

— 


— 

— 



sr*P 



The two simultaneous equations can be solved conveniently by the same pro- 
cedure described in Appendix 1, page 464. 

For the derivation of these equations, see W. Edwards Deming, Some Notes on 
Least Squares, pp. 136-141. U. S. Department of Agriculture Graduate School, 
Washington, 1938. 















100 


SIMPLE CURVILINEAR REGRESSION 



In Figures 16 and 17 the actual proportions of protein, shown as 
dots, are compared with the estimated values as worked out by the 
logarithmic relation. In the first of these fig- 
ures the actual and estimated values are both 
stated in terms of the logarithms. It is quite 
apparent here that this equation assumes a 
straight-line relation between the proportion of 
vitreous kernels and the logarithms of the pro- 
portion of protein; since they were computed 
by a straight-line equation (log 7 = a -H bX) 
the estimated values all lie along the continu- 
ous straight line indicated. The next figure, 
Fig. 16. Dot chart show- however, compares the actual proportion of 

ing observations and protein with the estimated, both stated in 
fitted line for equation i. ^ ± tt 

logF = a-hbX in loga- terms. Here the continuous curve 

rithms of V. which the logarithms produce in the esti- 
mated actual values is clearly shown. The re- 
lation between the proportion of vitreous kernels and the percentage 
of protein, as shown by this curve, does not agree with the actual 
relation as shown by the original observa- 
tions even as closely as did the previous 
curves computed by means of parabolic equa- 
tions. 

Before discussing other ways of express- 
ing the curvilinear relation it might be well 
to discuss the procedure to determine the 
constants a and b if either of the other two 
forms of simple logarithmic equations were 
used. 

If the equation 7_= a + h log X is employed, 
the form Y = a + hX is used. 

The values which must be computed are 


ZO 


15 


10 


y ^/0 






50 

X 


100 


My, Mr 


SX2 


S7X, 

and the constants are determined from the equations 


Fig. 17. Dot chart show- 
ing observations and 
fitted line for equation 
r=10"*’^\ in natural 
values of Y. 


S7X - nMyM-i 

SP - nMl 


a “ My — bMj 





^FITTING” A LOGARITHMIC CURVE 


101 


Since the equation is in terms of Y itself, the estimated values, com- 
puted from the logarithms of X, will be directly in values of 7, and will 
not have to be converted to the antilogarithms. 

__ If the equation log F = a + 6 log X is to be fitted, the foriq 
Y = a + 6X is used. 

The values which will have to be computed are: 

My, Ms, SFX, SX2, 

and the constants are determined from the equations 

SFX - nMyM^ 

^ “ ZX2 - nM| 

a = My — fcMs 


In this case the equation is in terms of F, the logarithms of F, 
and the estimated values will therefore have to be converted from 
logarithms into natural numbers to show just what the relationship is, 
just as was done in the case that was worked out in detail earlier. 

It is evident that no matter which one of the three logarithmic 
curves is employed, the arithmetic is exactly the same as in deter- 
mining the simple straight line, with the exception of computing the 
logarithms and of substituting the appropriate logarithms where the 
actual values would otherwise be employed. 

In cases where other modifications of the straight-line equation, 
such as type (e), are to be used, the process is to transform the equa- 
tion to a linear form, then compute the constants just as before. 

Thus the type 


a + bX 

can be converted to the form 

Y = a + bX 

or, letting ^ = Q, 

Q = a -f" bX 

The computation can* then be carried out in the usual way, and 
after the estimated values of Q, Q\ are worked, converted back into 


1’' values by the equation F' = 


Q' 



102 


SIMPLE CURVILINEAR REGRESSION 


Limitations of equations in describing relationships. Up to this 
point an expression of the relation between the proportion of vitreous 
kernels and the proportion of protein in each sample has been worked 
out on the basis of a number of different mathematical formulas. 
Each different equation has given a different curve. Some, such as 
the cubic parabola or the logarithmic curve, have given curves com- 
ing somewhere near to the relationship shown by the actual observa- 
tions themselves; others, such as the simple straight line, have entirely 
failed to describe the relation. Yet the exact slope or shape of each 
curve was determined from the same set of observations; the con- 
stants of each curve were determined by “fitting’^ the same data. The 


frofcln content 



Fig. 18. Original observations, and several different types of fitted curves. 

diversity in the shape of the different curves is strikingly shown in 
Figure 18, where the several different curves are all drawn on one 
scale, and the original observations are shown as well. It is quite 
apparent that the differences in the shapes of the several curves are 
due solely to the particular form of equation used in computing them. 
There are certain types of relations which can be accurately repre- 
sented by each of these equations. When it is “fitted” to data where 
that type of relation is really present, it can give a curve which 
accurately represents the true relation shown by the data. When, 
however, as in the present case, an attempt is made to represent a 
relation by an equation which does not truly express the nature of 
the relation, the resulting curve gives only a distorted representation 




LIMITATIONS OF EQUATIONS 


103 


of the true relation — it shows the relation only insofar as it is possible 
to do so within the limits of the particular equation used. 

So far there has been no attempt to show what there is in the 
“nature’^ of relations which may make them of the type to be repre- 
sented accurately by one type of equation or by another. Instead, 
the purely empirical test of the way each one fits has been relied 
upon. If, as judged by the eye, the relation shown by the fitted 
curve looked like the relation shown by the original observations, we 
have said it gave a satisfactory fit; if it has not looked like it, we 
have said it did not give a satisfactory fit. And in this particular 
case, none of the computed curves has been really fully satisfactory — 
we can readily see that there might be some other smooth continuous 
curve which would come much closer to the actual observations than 
does any of the curves so far computed. 

Of course we might continue the process, using more and more 
complex equations, until finally we found one which did satisfactorily 
describe the relation. Or we might find that no ordinary mathematical 
expression would describe the relation. It might be that the under- 
lying curve was so complex that it could not be represented in 
elementary algebraic terms. But even if we could describe the rela- 
tion satisfactorily by some type of equation, the only advantage 
would be that then we would have some way of estimating values 
of the dependent variable (percentages of protein) from the inde- 
pendent variable (proportion of vitreous kernels) such as would 
agree reasonably well with the values actually observed. So long 
as the equation had been derived merely by the *^cut-and-try^' 
method described, it would have no meaning beyond serving as a 
simple device for estimating values of the one variable from known 
values of the other and would throw no particular light upon the 
real or inherent nature of the relation. For if we could find, by 
enough trying, one equation which would represent the relation satis- 
factorily, it might be that we could also find another. As a matter 
of fact, sometimes it is found that two different types of equations 
may each give exactly the identical curve when figured out.^^ Which 
one expresses the “true'' nature of the relation? Merely because a 
given equation can reproduce a certain relatioj^ is no proof that it 
really ^'expresses” the nature of the relation. Something more must 

An example of this type may be seen in the bulletin, What makes the price 
of oats, by Hugh Killoiigh, U. S. Department Agriculture Bulletin 1351, page 8. 
Here equations of two different types were found to yield almost identical curves, 
within the range covered by the observations studied. 



104 


SIMPLE CURVILINEAR REGRESSION 


be known than merely that it can express the relation. What that 
something is will be taken up in a later section. 

If, however, it is not desired to determine what the “real nature^^ 
of the relationship is, but it is merely desired to express it suffi- 
ciently well so that values of one variable (such as protein content) can 
be estimated from known values of another (such as the proportion 
of vitreous kernels), it does not make any difference what type of 
equation is used, so long as it represents the observed relationship 
adequately. As a matter of fact, it is not really necessary to have 
an equation at all. If we have only a graph of the curve, or a table 
of values for one variable corresponding to values of another, from 
which we can construct a graph, that is all that is really necessary. 
For if we have a graph of the curve we can very readily estimate the 
value for one variable from corresponding known values for another 
by simply reading it from the curve. Thus in Figure 13 the curve 
for the equation 

Y ^a + hX + cX^ 

is shown. If we wish to estimate the percentage of protein for a sample 
having, say 50 per cent of vitreous kernels, we need only to run up 
the line for X = 50 and note the value of Y corresponding to that 
point on the curve. In this case it is apparently about 10.8 per cent. 
Similarly, the estimates of the percentage of protein corresponding to 
any other percentage of vitreous kernels within the range covered by 
the curve may be read off directly from the curve. Further, by 
enlarging the chart and making the scale sufficiently detailed, we may 
read off the estimated values to any degree of accuracy that is desired 
— much more accurately, as a matter of fact, than our ability to de- 
termine the real relation usually justifies, as will be evident later on. 

In many cases — perhaps in the great majority of cases — simply 
the working expression of the relation may be all that is cither needed 
or desirable. The “true relation” between the variables may be so 
involved that a very complex mathematical expression would be re- 
quired to represent it properly. Even simple types of physical rela- 
tions may require rather complex curves to represent them. In 
many cases, too, the knowledge of the causes of the relation may be 
so undeveloped that there is no real basis for expressing the relation- 
ship mathematically. The relation between vitreous kernels and 
percentage of protein would be an example of this type — very complex 
details of chemical content and physical and biological structure are 
probably responsible, so complex as to be quite beyond satisfactory 



"FITTING” A FREE-HAND CURVE 


105 


reduction to mathematical expression. Yet the original observations 
undeniably indicate that there is some sort of definite relation. For 
many practical purposes it may be entirely satisfactory merely to 
know what the relationship is, without bothering at all with what 
it really means. Even in scientific study that may frequently be 
satisfactory as a first step, since in many cases it is essential to know 
what are the facts before trying to work out the reasons why they 
are as they are. 

When the expression of the relation is not to be used except as 
an empirical basis for estimating values of the dependent variable 
from the independent, or for showing just what the relationship is, 
the elaborate technique of determining the constants of a mathe- 
matical equation and working out the estimated values by the use of 
that equation becomes largely unnecessary. In many cases a curve 
can be determined with only a small fraction of the effort required in 
^'fitting” a mathematical equation, yet it fits the data quite as well 
as any mathematical curve. In such cases the curve may afford quite 
as satisfactory a description of the relation and a basis for estimating 
one variable from the other as if elaborate computations had been 
made. This method is known as freehand smoothing. 

Expressing a curvilinear relation by a freehand curve. The 
process of determining a freehand curve may be very simply illus- 
trated. In fact, it has already been suggested in much of the previous 
discussion. The very simplest way to do it would be to plot the 
original observations on coordinate paper, just as has been shown 
so many times before, and then draw a continuous smooth curve 
through them by eye in such a way as to pass approximately through 
the center of the, observations all along its course. Where the nature 
of the relation is indicated quite as closely by the original observations 
as it is in the wheat problem which we have been discussing, this 
might yield quite a satisfactory expression of the relation. In other 
cases, however, the observations might be more widely scattered, 
and the underlying relation might be more difficult to determine, so 
that different persons, drawing in the curves freehand, might draw 
in rather different curves. Some method is therefore needed to give 
a greater degree of precision to the result, and to insure that the 
same data would yield substantially the same result even in the 
hands of different investigators. 

This stability of result can be secured by a relatively minor ex- 
tension of the methods already discussed in the first illustration of 
a two-variable relationship — ^the automobile-stopping problem. There 



106 


SIMPLE CURVILINEAR REGRESSION 


it was found that by classifying the observations in appropriate 
groups, the general nature of the relation could be expressed by 
an irregular line connecting the several group averages. All that is 
needed is some method of deriving a continuous smooth curve from 

TABLE 23 

Computation of Averages to Use in Fitting Freehand Curve, for Whbat- 

ibiOTEiN Problem 



Vitreous kernels 

Vitreous kernels 

Vitreous kernels 

Vitreous kernels 


below 26 per cent 

25 to 49 per cent 

60 to 74 per cent 

76 to 100 per cent 


Per cent 
vitreous 
kernels 

Per cent 
protein 

Per cent 
vitreous 
kernels 

Per cent 
protein 

Per cent 
vitreous 
kernels 

Per cent 
protein 

Per cent 
vitreous 
kernels 

Per cent 
protein 


6 

10.3 

34 

10.9 

55 

11.1 

75 

12.2 


17 

11.0 

45 

10.8 

61 

11.4 

87 

14.5 


24 

10.1 

36 

10.2 

74 

13.8 

98 

18.1 








91 

14.0 


' 






97 

17.0 








85 

14.4 








96 

16.8 








92 

16.6 








94 

15.0 








84 

13.3 








99 

19.0 








Totals 

47 

31.4 

115 

31.9 

180 

36.3 

998 

168.9 

No. cases. 
Averages . 

3 


3 


3 


11 


15.67 

10.47 

38.33 

10.63 

60.00 

12.1 

90.73 

16.36 


that irregular line. Smoothing out that irregular line, freehand, is 
a very evident and simple method. At the same time, starting with 
the irregular line of group averages gives a certain stability to the 
process and insures that different persons would draw in the curve 
with about the same position and shape. 

Applying the process to the wheat problem, the first step is to 
classify the data into appropriate groups according to the values 
of the independent variable, the proportion of vitreous kernels, and 
to determine the average percentage of vitreous kernels and of protein 
content for the observations falling into each group. The discussion 
of the automobile problem has shown that, for the differences in 




TITTING^^ A FREE-HAND CURVE 


107 


averages to be significant, it is necessary for the groups to be large 
enough so that the averages would not vary erratically from group to 
group. In some cases a little experimenting might be necessary to 
determine what this size would be. In the present case, inspection of 
the dot chart showing the original- observations (Figure 12, page 83) 
indicates that a class interval of 25 per cent of vitreous kernels will 
give groups large enough to make the averages of protein content 
fairly stable from group to group. 

The form of computation most convenient to obtain the group 
averages, using groups of the size suggested, is shown in Table 23. 

The averages for the several groups are shown in Figure 19, indi- 


Prptein confent 



Fio. 19. Original observations and averages of protein content, and freehand curve. 

cated by hollow circles, wliereas original observations are again shown 
by solid dots. A smooth continuous dashed curve has been drawn 
through the series of group averages, ignoring the individual ob- 
servations and following only tlie general trend shown by the averages. 
This smooth curve comes quite near to representing the relation shown 
by the individual observations through most of its extent; but beyond 
95 per cent of vitreous kernels it fails to follow the individual obser- 
vations — through that portion of the range the protein content rises 
much faster than is indicated by the average for the whole range 
from 75 through 100 per cent vitreous kernels. 

Because over half of all the observations fall in this upper portion 
of the range, it would seem reasonable to classify them into smaller 




108 


SIMPLE CURVILINEAR REGRESSION 


groups SO as to give a better basis for determining this portion of 
the curve. Let us try splitting the observations above 50 into four 
groups, each with about the same number of observations — say 50 to 
69, 70 to 84, 85 to 94, and 95 to 100. The computation of the new 
averages is shown in Table 24. 


TABLE 24 

Computation op Sub-averages fob Last Groups in Wheat Problem, fob Fitting 

Freehand Curve 



Vitreous kernels 

Vitreous kernels 

Vitreous kernels 

Vitreous kernels 


50 to 69 per cent 

70 to 84 per cent 

85 to 94 per cent 

95 to 100 per cent 


Per cent 
vitreous 
kernels 

Per cent 
protein 

Per cent 
vitreous 
kernels 

Per cent 
protein 

Per cent 
vitreous 
kernels 

Per cent 
protein 

Per cent 
vitreous 
kernels 

Per cent 
protein 


55 

11.1 

75 

12.2 

87 

14.5 

98 

18. 1 


51 

11.4 

74 

13.8 

91 

14.0 

97 

17.0 




84 

13.3 

85 

14.4 

96 

15.8 




92 

15.6 

99 

19.0 






94 

15.0 








Totals 

106 

22.5 

233 

39.3 

449 

73.5 

390 

69.9 

N 0 . cases . 

2 


3 


5 


4 


Averages . 

53 

11.25 

77.67 

13.1 

89.8 

14.7 

97. 5 

17.48 


These new averages, together with the ju’evious ones for the lower 
groups, are also plotted in Figure 19, and tlic iiiiniber of cases that 
each represents is indicated next to it, to aid in judging what Aveiglit 
to assign to that average. Finally, a smooth continuous curve has 
been drawn in, to pass as near as possible to the different averages 
without making illogical twists or turns. As is evident in the figure, 
it has been possible to draw the line with no point of inflection in it, 
yet so that it passes quite near to all the group averages and ap- 
proximately through the middle of the individual observations. Fur- 
ther, the general course of the line is sufficiently well defined by the 
several group averages so that if it were redrawn, either by the same 
person or another person, it could have only minor differences from 
the line actually shown. Making the chart over two or tliroe times, 
and drawing a separate curve on each trial, then averaging the two or 
three curves together, is one method of reducing the variation due to 
individual judgment in drawing the curve. 



FITTING” A FREE-HAND CURVE 


109 


Cautions in freehand fitting. In drawing in the freehand curve 
no attempt has been made to have the curve follow all the twists and 
turns of the irregular line of averages. As was shown previously 
with the automobile illustration, these irregular differences from group 
to group may very readily be due to chance fluctuations in sampling 
where the groups are small. Not unless the groups included a very 
much larger number of cases than these do here would one be justified 
in bending the curve because of the position of a single group average, 
and not even then unless there was some logical basis for a curve of that 
shape. In doubtful cases breaking up a particular group into smaller 
groups, as was just done in the wheat example, or reclassifying the 
observations into somewhat different groups, will help to determine 
whether or not the data positively indicate that an extra inflection 
is needed. It is also necessary to see if some single observation is 
responsible for the abnormality; if it is, it is better to disregard it 
and draw the curve without the extra twist. 

In drawing in a freehand curve, it is desirable to place certain 
logical limitations on the shape of the curve rather than to have it be 
purely an empirical representation of the data. To do this, it is 
necessary to decide before the curve is drawn what those limitations 
should be. The limitations should be based upon a logical analysis 
of the relation under examination, in the light of all the information 
available to the investigator. In this case, for example, a considera- 
tion of the biological structure of the kernels, of the portions which 
run high in protein content, and of the appearance and size of those 
portions might lead one to the following conclusions: 

(a) An increase in the proportion of vitreous kernels might be 
associated with no change in the proportion of protein, or with an 
increase in the proportion, but never with a decrease in the proportion. 

(b) The relation between vitreous kernels and protein should be 
a progressive one, consistently changing throughout the range of 
variation, rather than fluctuating back and forth. 

(c) The maximum proportion of protein would be found with the 
largest proportion of vitreous kernels. 

These three logical expectations might then be expressed in the 
following limitations to be placed on the shape of the curve to be 
drawn : 

(1) The curve should have no negative slope throughout its length. 

(2) The curve should have no points of inflection, but should 
change shape continuously and progressively. 

(3) The maximum sliould be reached at the end of the curve. 



110 


SIMPLE CURVILINEAR REGRESSION 


These three logical limitations are all fulfilled by both the curves 
shown in Figure 19, yet they would exclude other types of curves 
which might be drawn. For example, they would rule out a curve 
with a hump or twist in it, or one which sloped down and then up.^® 

In some cases, examination of the data by the method of successive 
group averages, even after all the tests suggested above, will show the 
presence of a relation which cannot be expressed within the logical 
limitations imposed on the shape of the curve. In that case, the rea- 
soning underlying the logical analysis should be reexamined, to see 
if some step requires restatement and if the limitations themselves 
should be changed. (For a further discussion of this interaction of 
induction and deduction, see pages 443 to 452 of Chapter 24.) For 
a curve to have real meaning, it must be consistent with a careful 
logical analysis, no matter whether the curve is obtained mathe- 
matically or freehand, or whether the logical limitations are expressed 
in a ma^iematical equation or in a set of limitations placed on the 
shape of the curve drawn by freehand fitting.^® 

Interpreting the fitted curve. It is evident that the freehand curve 
comes closer to agreeing with all the original observations than did 
any of the mathematically determined curves. So far as can be 
judged by eye alone, it ‘^fits’^ the relation actually observed quite 
satisfactorily. So far as giving a definite statement of the relation, 
and serving as a basis for estimating values of one variable from known 
values of the other, this curve, obtained by the very simple process 
shown, is more satisfactory than any of the curves obtained by the 
mathematical computations. 

The use of the freehand curve in estimating values of the dependent 
variable, percentage of protein, from known values of the independent 
variable, proportion of vitreous kernels, may be readily illustrated. 
Taking the first observation, with 6 per cent of vitreous kernels, and 
reading off the corresponding proportion of protein from the curve 

This use of logical analysis in stating the limitations on a freehand curve 
may be compared with the use of logic in deciding on the type of mathematical 
equation to employ. Note the subsequent section in this chapter on “The logical 
significance of mathematical functions.^' 

For a more detailed discussion of the pros and cons of freehand versus mathe- 
matical fitting, see W. Malenbaum and J. D. Black, The use of the short-cut 
graphic method of multiple correlation, Quarterly Journal oj Economics, Vol. LII, 
November, 1937, and The use of the short-cut graphic method of multiple correla- 
tion: comment, by Louis Bean, and Further comment, by Mordecai Ezekiel, and 
Rejoinder and concluding remarks, by Malenbaum and Black, Quarterly Journal 
of Economics, February, 1940. 



FITTING” A FREE-HAND CURVE 


111 


in Figure 19, we get 10.4 per cent as the estimated protein content. 
Similarly for the second observation, 75 per cent vitreous kernels, 
the curve indicates 12.9 per cent as the proportion of protein. Reading 
off the estimated protein for each of the 20 observations we get the 
estimates shown in Table 25. 

Even though in using the freehand curve we do not have an 

TABLE 2S 

Actual Per Cent of Protein and Proportion Estimated on Basis of Freehand 

Curve 


Proportion of 
vitreous kernels 

X 

Actual proportion 
of protein 

F 

Proportion of 
protein estimated 
from vitreous 
kernels 

r =/(X) 

Difference between 
actual and estimate 

F - F' 

6 

. 10.3 

10.4 

-0.1 

76 

12.2 

12.9 

-0.7 

87 

14.6 

14.5 

0 

66 

11.1 

11.4 

-0.3 

34 

10.9 

10.7 

0.2 

98 

18.1 

17.4 

0.7 

91 1 

14.0 

15.2 

-1.2 

46 

10.8 

11.1 

-0.3 

61 

11.4 

10.3 

1.1 

17 

11,0 

10.5 

0.6 

36 

10.2 

10.8 

-0.6 

97 

17.0 

17.0 

0 

74 

13.8 

12.8 

1.0 

24 

10.1 

10.6 

-0.6 

86 

14.4 

14.2 

0.2 

96 

16.8 

16.7 

-0.9 

92 

15.6 

15.6 

0.1 

94 

16.0 

16.9 

-0.9 

84 

13.3 

14.0 

-0.7 

99 

10.0 

18.0 

1.0 


equation stating the relation between X and F, we still have a mathe- 
matical expression of the relation between them. For we can write 

F' =/(X) 

which simply means that the estimates, or F' values, are a function 
of X; that is, for every X value there is some corresponding F' 




112 


SIMPLE CUEVILINEAR REGRESSION 


value. Of course, we can find what this corresponding value is only 
by reading it off the curve; yet that is enough. We have a graphic 
statement of the functional relation; if we had a definite formula 
to represent the curve, we would have an analytical statement of the 
relation as well. 

Although we do not have a definite equation to represent the free- 
hand curve, it is still possible to state the relation shown by the 
curve other than in graphic form. This can be done by constructing 
a table showing, for whatever values of the independent variable may 
be selected, the corresponding estimated values of the dependent vari- 
able. Such a tabular statement of the relation may be more readily 
comprehended by readers not accustomed to graphic presentation. 
Further, it provides a basis for reconstructing the curve on any scale 
desired for the purpose of making further estimates. Table 26 illus- 
trates this method of stating the relation. 

TABLE 26 

Pee Cent of Protein Corresponding to Various Proportions of Vitreous 
Kernels in Samples of Wheat, as Indicated by 20 Observations 


Proportion of 
vitreous kernels 

Corresponding 
proportion of 
protein 

Proportion of 
vitreous kernels 

Corresponding 
proportion of 
protein 

Per cent 

Per cent 

Per cent 

Per cent 

10 

10.4 

70 

12.4 

20 

10.5 

80 

13.6 

30 

10.7 

90 

16.0 

40 

10.9 

95 

16.2 

50 

11.2 

99 

18.0 

60 1 

11.7 




In the range where the curve is rising most steeply the readings 
are taken more closely together, to provide for jeproducing that por- 
tion of the curve more accurately. In addition, no readings are taken 
beyond the range covered by the original observations, nor are any 
shown for the extreme ends where the observations are few. This 
raises the whole question of how curves like this can serve as a basi? 
for estimating when measurements are made of the independent vari- 
able, such as proportion of vitreous kernels, in cases other than those 
used in determining the relation. This problem will be taken up ai 








LOGICAL SIGNIFICANCE OF MATHEMATICAL FUNCTIONS 113 

the end of this chapter. But first the question of whether to use 
freehand or analytical curves will be discussed. 

The logical significance of mathematical functions. There has 
been frequent reference previously to the question whether an equa- 
tion did or did not express '^the real nature” of a relationship, with 
little explicit attempt to explain exactly what that meant. To know 
when we are justified in using the simple freehand curve, and when 
we should go to the additional work of determining an equation for 
the curve, we must understand the logical bases for different types 
of equations, so that we can judge whether or not any particular type 
of curve can logically be expected to express the relation in any given 
set of observations. 

The linear equation. Many relations are so simple that ordinarily 
we would not think of expressing them mathematically. Thus, if a 
train is traveling 45 miles an hour, the distance traveled is equal to 
the time multiplied by the speed. Using t for the time in hours, d 
for distance, and s for speed, the relation is obviously 

d — d 

This is a simple straight-line relation. Now, if, in addition, the 
train were a miles away from a given station at the beginning, after 
t hours of additional travel away from the station it would be D miles 
away, where 

D^a~\-d^a-\-st 

This is now expressed in the usual form for the straight-line 
equation, 7 = a + hX, This equation is therefore the one to be 
used when it can logically be expected that each unit change in X 
causes a corresponding change in 7, regardless of the size of X. Thus 
in computing the distance the train has traveled we are assuming 
that it continued to travel at a definite rate, say 45 miles an hour, 
the whole way, and traveled the 200th mile just as fast as the first 
mile. Now if we were dealing with something where the change in 
7 was not the same for different values of X, the equation would no 
longer be satisfactory. For example, an airplane on a long-distance 
flight has to carry a heavy load of gasoline at the start and hence 
cannot attain full speed; the farther it goes the lighter its load be- 
comes and the higher speed it can make. In such a case the straight- 



114 


SIMPLE CURVILINEAR REGRESSION 


line formula would not be applicable, since the speed of the plane 
would increase with the distance it had gone. If the straight-line 
formula were used, it would indicate that it would take just as long 
to travel the first hundred miles as the last hundred, whereas actually 
it would take longer than that to travel the first hundred and less 
than that to travel the final hundred. Only an equation which in- 
cluded some value that properly took into account the change in speed 
with the change in distance could satisfactorily represent this relation. 

The quadratic equation. Another case in which the rate at which 
Y increases changes as the value of X increases is that of a weight 
falling to the ground. Since the attraction of the earth is for prac- 
tical purposes a constant, it exercises a constant pull on a falling 
body. Thus, the farther a body falls, the faster it travels. It is just 
as if, in throwing a ball, a boy did not let go the ball for it to travel 
by its momentum but was able to keep shoving against it, adding 
more and more speed to the momentum it already had. Physicists 
express this relation by saying that the velocity with which an object 
falls is accelerated at a constant rate. This equation, therefore, is: 

V^gt 

where p is a constant measuring the force of gravity, V is velocity in 
feet per second, and t is time in seconds. 

With regard to the distance a body will fall in any given time, 
therefore, the case is much the same as with our airplane. The 
velocity, or speed, is increasing with every passing moment, and there- 
fore the distance traveled in each succeeding second will be greater 
than the distance traveled in the previous second. 

If we assume that the value of g in the equation is already known 
to be 32, the equation 

V=gt 

can then be written 

F= i2t 

We can then estimate the distance traversed by a falling body in 
each successive second by a process of approximation like this: 

Let us figure that the average speed for each 2 seconds is the same 
as at the midpoint (which may not be exactly right) and then let us 
estimate the distance traversed in those 2 seconds by multiplying this 
average speed by the time. Then by adding all the distances together 
we can get an approximation of the total distance. 



LOGICAL SIGNIFICANCE OF MATHEMATICAL FUNCTIONS 115 

First we need to calculate the average speed for each period, using 
the last equation, F = 32^: 

End of 1st second, speed = 32(1) = 32 = average speed for 1st two seconds 

End of 3d second, speed = 32(3) = 96 = average speed for 2d two seconds 

End of 6th second, speed = 32(6) = 160 = average speed for 3d two seconds 

End of 7th second, speed = 32(7) = 224 = average speed for 4th two seconds 

End of 9th second, speed = 32(9) = 288 = average speed for 6th two seconds 

Then we can estimate the distance traveled in each 2-second period, 
as follows: 


Period 

Average speed, feet 

Distance in that 


per second 

period, feet 

1st 

32 

64 

2d 

96 

192 

3d 

160 

320 

4th 

224 

448 

6th 

288 

676 

Estimated total distance 

1600 


Another estimate could be obtained by estimating the distance for 
each second separately, for there might be less error in assuming that 
the speed at the middle of each second would represent the average for 
that second. On this basis the problem would work out. 

Speed at middle of Ist second = 32 (|) = 16; distance in that second = 16 

Speed at middle of 2d second = 32(1 J) = 48; distance in that second = 48 

Speed at middle of 3d second = 32(2^) = 80; distance in that second = 80 

Speed at middle of 4th second = 32(3|) = 112; distance in that second = 112 

Speed at middle of 6th second = 32 (4 J) = 144; distance in that second = 144 

Speed at middle of 6th second = 32(6j) = 176; distance in that second = 176 
Speed at middle of 7th second « 32(0j) = 208; distance in that second = 208 
Speed at middle of 8th second = 32(7 J) = 240; distance in that second — 240 

Speed at middle of 9th second » 32 (8 J) =* 272; distance in that second = 272 

Speed at middle of 10th second = 32 (9 J) = 304; distance in that second = 304 

In 10 seconds, total distance traversed = 1,600 

This comes out exactly the same as before. On reflection, it is 
evident that this is to be expected. Since the velocity increases at 
a uniform rate for each moment of time, the true average rate of speed 
for any period will be just half way between the speed at the be- 



116 


SIMPLE CURVILINEAR REGRESSION 


ginning and at the end.^^ If we consider our 10 seconds as a whole, 
the velocity at the beginning is equal to 

V = 32(t) = 32(0) = 0 

that is, the initial velocity is zero; whereas the velocity at the end is 

y = 32(t) = 32(10) = 320 

The average speed for the period, therefore, is 

0 +320 

-L- = 100 

2 

which is exactly the same as the speed at which the body is falling at 
the middle of the period, at the end of the fifth second, which is 

y = 32 (^) = 32(5) = 160 

Computing the total distance traversed by multiplying the total 
time by this average speed, we have 

d = (160)(10) = 1,600 

giving exactly the same answer as our earlier computation. 

The average speed during any period of t seconds is therefore 32t,/2. 
The total distance traversed in the t seconds can therefore be deter- 
mined by multiplying the average speed, 32^/2, by the total number of 
seconds, t. This gives 



or 

d = 32- 
2 

= 

So far, we have assumed that we know the acceleration, or rate of 
increase in velocity per second. Suppose instead we had not known it 
to begin with. How could we have found it out? 

If we had used the symbol g to represent this value, we could have 
carried out all the previous calculations, except that we sliould have 
used ‘‘g’’ where instead we have used ^‘32. ” 

This would not be true of all types of relations. If, for example, velocity 
increased at a changing rate, the smaller the units taken th(? more accurate would 
be the result. 



LOGICAL SIGNIFICANCE OF MATHEMATICAL FUNCTIONS 117 


Our last formula then would have been 


or 




If we let ~ = 6, the equation then would read 
h 

d = bt^ 


We could readily determine the value for b by observing the 
distance a given body falls in 1 second, in 2 seconds, in 3 seconds, etc., 
and then working out the probable value for the constant, just as has 
been done before. 

After we had made measurements of several distances d in the 
several periods t, we could determine b most readily for the straight- 
line equation by using T for Then 

d--hT 


(which is the same form as 7 = a + hX). 


Since we may assume a = 0, it follows, from equation (10). 


that 

Hence 

and 


a = My — hMx 
0 = My — hMx 
hMx = My 


Mx ~ SX 


or, in the terms of this particular example, 


which gives a basis for determining (jf, the acceleration due to gravity in 
feet per second, simply by making observations of the time for bodies 
to fall varying distances. 

Substituting an observation of 64 feet in 2 seconds in this equation 
gives 6 = = 16; hence g = 32. 



118 


SIMPLE CURYILINEAR REGRESSION 


In this case it should be noted that the formula 



is derived on the assumption that the attraction of gravity is a con- 
stant, tending to increase velocity at a uniform rate per second, or 
other unit of time. Only if this assumption is correct can the equation 
be used. The equation is directly based upon this assumption; the 
reasoning used in deriving the equation also serves to explain what 
the constants obtained really represent. On the basis of this reasoning 
the equation determined is not a mere empirical expression of the 
relation between time falling and distance traversed. Instead, it is a 
fundamental measurement of why that distance is what it isj and 
relates it in a logical manner to the attraction of the earth. 


/ 



Fig. 20. The trajectory of a projectile, illustrating the equation 
Y=a + bX+cX^. 

Although it would be quite possible in this particular case to draw 
a freehand curve expressing the relation between time and distance, 
it would not be so satisfactory as the mathematical equation. The 
curve would merely state what the relation was; tlie equation, in 
addition, explains why it is, in the terms of a particular liypothesis. 

The parabolic equation. Another physical case in wliicli a definite 
relationship may be established logically, and then measured statis- 
tically, is the firing of a projectile from a gun. 

Disregarding the resistance of the air, there are three elements 
which will determine the height the projectile will have reached at any 
given instant after it leaves the muzzle .of the gun. The simplest 
of these elements is the height of the muzzle of the gun itself, repre- 
sented by a in Figure 20. All the subsequent changes in elevation will 
obviously have to be added to that. 

The second element is the rate at which the projectile is moving up- 



LOGICAL SIGNIFICANCE OF MATHEMATICAL FUNCTIONS 1X9 


ward at the instant it leaves the muzzle. That is dependent, of course, 
on the angle at which the gun is elevated and the muzzle velocity. If 
the gun were elevated 1 per cent from the horizontal and the muzzle 
velocity were 1,000 feet per second, the projectile would leave 
the muzzle moving upward at the rate of 10 feet per second. If there 
were no resistance of the air, and if there were no force of gravity 
to pull the projectile off its course, its momentum would carry it on 
in this direction to infinity, as illustrated by the straight line in the 
picture. Here b represents the increase in elevation the projectile 
would attain for each additional second of flight, and a and bt the ele- 
vation it would attain if gravity did not influence it. 

But gravity is at work too. As we have already seen, as soon as a 
body is released, the pull of gravity tends to move it downward at 
ever-increasing speed. Even if it is headed upward as when shot 
from a gun, the pull of gravity starts tending to pull it down. The 
diagram illustrates what happens, with C used to represent the distance 
the body would have fallen if it had no upward velocity. At first the 
gain in height from its upward momentum is more than enough to 
offset the tendency to lose height because of the pull of gravity, and 
the projectile moves upward along the curved course indicated. But 
finally the loss due to gravity becomes greater than the gain from its 
original upward momentum and the trajectory gradually turns down- 
ward, until the projectile finally comes to rest in the earth or on its 
target. 

The height that the projectile reaches at any moment is the sum 
of these three components — the original height, the upward course, and 
the loss by gravity. Its height, then, can be expressed by adding 
together the three elements. 

a remains the same, regardless of the time elapsed. 

B, the height due solely to the original momentum, depends on 
the time, increasing as the time increases. If we let b represent the 
initial rate of gain in elevation per second of time, B can then be 
stated: 

B = bt 


Finally, C depends on the time elapsed, and, as we have just seen, 
varies with the square of time. With the same notation as in our 
falling-stone problem, but with C substituted for distance fallen 



120 


SIMPLE CURVILINEAR REGRESSION 


Adding these three elements together, we obtain the equation for 
the height of the projectile at any instant, letting H represent height 
in feet. 

II ^ (X ht ct 

It will be seen that this equation is exactly identical in form with 
the equation for a parabola 


7 = a + + cX^ 

Measurements of the height of the projectile at various given times 
after firing the charge, made for a given gun, firing the same charge 
at the same elevation of the gun, would give a series of X and 7 values 
which could be used in computing the constants a, b, and c, even if all 
were unknown to start with. 

If the equation were actually worked out, it would tell much more 
than merely the graph of the relation. For if the reasoning on which 
the several different constants were included in the equation was 
correct, then the equation would furnish a real explanation of why the 
projectile moved as it did, in terms of the laws of motion and of 
gravity upon which all such movements depend. 

Reasoning such as this, carried out to much greater lengtlis, has 
formed the basis for the scientific “laws^^ which have been discovered 
in physics and chemistry and expressed in definite equations. The 
methods for determining the constants in such equations, as presented 
earlier in the chapter, were devised to serve in determining such types 
of relations. But when the same methods are applied to biological, 
economic, educational, or other relationships in the natural or social 
sciences, their value is much more limited. Only rarely is tliere real 
basis for expecting a particular mathematical relationship such as 
can be expressed in a given type of equation. In many cases our 
knowledge of the reasons for the relationship are altogetlier too 
limited to enable us to say why the relationship is; and even where 
we can establish the reasons, they are frequently too complicated or 
too involved — or even too biological — ^to admit of mathematical treat- 
ment. If we express a given relation by a formula, merely on the 
basis that that formula seems to describe the observed relation satis- 
factorily, we do not have any greater knowledge of the relation than 
if we merely drew in a freehand curve. The equation is simply an 
empirical description of the relation; of and by itself, it offers no 
clue as to what the relation means. 

When to fit a mathematical equation. From this discussion, the 
following tentative conclusion may be reached: Only when tliere is 



A MATHEMATICAL EQUATION IN AN ECONOMIC PROBLEM 121 


some good logical basis for expecting a certain type of relation to hold 
should mathematical curves be employed in describing the relation- 
ship. When there is a logical basis for using a given formula, the 
constants of the equation serve as an explanation of the real nature 
of the relationship. In all other cases the mathematical curve has 
no more significance than the freehand curve; the latter may there- 
fore be employed to describe the nature of the relation, and can be 
determined with much less expenditure of effort. That does not mean 
that a mathematical curve, based on adequate logical analysis, is of no 
additional value. If it can be shown that such a curve does fit the 
data, that may verify an hypothesis and so provide a 'fiaw^^ to state 
the nature of the relationship, which may be of far more value than 
the mere empirical statement of what the relationship is observed 
to be. If, however, there is no logical basis for anything except the 
empirical statement of the observed relation, the freehand curve 
is just as valuable as one fitted by aid of a mathematical equation. 

Where the logical expectations do not lead to a relation which can 
be formally expressed in a simple equation, they may, as has already 
been shown, still be sufficient to state a set of limiting conditions 
to be used in fitting a freehand curve. 

A mathematical equation used in an economic problem. Econo- 
mists sometimes use the hypothesis that for any one commodity there 
will tend to be a constant relation between the rate of change in the 
quantity consumers would buy and the rate of change in price. That 
is, if an increase of, say, 1 per cent in price would cause a 2 per cent de- 
crease in consumption when prices were low, a similar increase of 1 per 
cent in price would still cause a decrease of 2 per cent in consumption 
even when prices were high and consumption was already low. 

This economic hypothesis can be stated in definite mathematical 
terms quite as readily as the various physical hypotheses which have 
been mentioned; for it makes certain definite assumptions as to the 
precise way the two variables (price and consumption) are related. 

If C is used for quantity consumed and P for price, the statement 
says that the relation 

C = /(P) 

that is, that the quantity consumed depends upon and varies with 
price, is a function of the type 


C = kP^ 



122 


SIMPLE CURVILINEAR REGRESSION 


The reason for its being that type can be seen by stating the last 
equation in logarithmic form: 

log C = a + h log P 

This says now that a given change in the logarithm of P is always 
accompanied by a change of b times as much in the logarithm of C. 
Remembering that the same absolute change in the logarithm of a 
number always means a constant percentage change in its actual 
value, we can see that this equation states the economic hypothesis 
that a given proportional change in price is always accompanied, on 
the average, by a constant proportional change in consumption, no 
matter whether price was high or low to start with. 

The practical application of the logarithmic demand equation may 
be illustrated by a concrete case. Table 27 shows the slaughter of 
hogs (under federal inspection) in the United States during the years 
1922 to 1927 and the average price paid by packers during those years. 
If we assume that all the meat and other products from these hogs was 
consumed and ignore any possible shifts in the levels of demand during 
that period, we may ask whether the relation between the annual 

TABLE 27 

Slaughtee of Hogs, and Avekage Price, and Computation of 
Logarithmic Curve 
(log C - a -\-hhg P) 


Year* 

Weight of 
hogs 

slaughtered! 

(C) 

Price of 
hogs! 

(P) 

Logarithms of data 

Extensions 

Slaughter 

Price 

CP 

p2 


Billion 

Dollars 

C 

P 




pounds 

per cwt. 





1922-23 

11.66 

7.62 

1.0667 

0.8820 

0.94083 

0.77792 

1923-24 

11.83 

7.61 

1.0730 

0.8814 

0.94574 

0.77687 

1924-25 

10.25 

10.71 

1.0107 

1.0298 

1.04082 

1.06049 

1926-26 

9.66 

12.16 

0.9860 

1.0849 

1.06863 

1.17701 

1926-27 

10.04 

10.84 

1.0017 

1.0350 

1.03676 

1.07123 

1927-28 

10.99 

9.20 

1.0410 

0.9638 

1.00332 

0.92891 

Sums 



6.1781 

5.8769 

6.03610 

5.79243 


* From November to October, inclusive, 
t Live weight of hogs slaughtered under federal inspection. 

t Average costs to packers, at live weight. Adjusted for differences in price level, to 1928 level 



A MATHEMATICAL EQUATION IN AN ECONOMIC PROBLEM 123 


average price and the consumption of hog products in the United States 
during this period agrees with the hypothesis that a given propor- 
tional fall in price causes a constant proportional rise in consumption. 
We may at least roughly hold constant the effect of changes in price 
level by adjusting the price averages for concurrent changes in the level 
of wholesale prices. 

Accordingly we “fit” the equation 

log C = 0 -f 6 log P 

(where C = consumption, and P = price) 


to the data by the methods previously discussed. The actual com- 
putations are all shown in Table 27. 





n 


6.1781 

6 

5.8769 

6 


= 1.02968 


= 0.97948 


S(cp) = S(CP) - nMjMp 


= 6.03610 - 6(1.02968) (0.97948) = - 0.01521 
S(p2) = S(p) - nMl = 5.79243 - 6(0.97948)^ = 0.03614 


S(^) 

S(p2) 


-0.01521 

0.03614 


= - 0.42086 


a-^ = Ml- bMp = 1.02968 - (- 0.42086) (0.97948) 

U = a^p -f bjpP 

= 1.4419 - 0.42086P 


log C = 1.4419 - 0.4209 log P 


Wc may next test how well this equation describes the relation- 
sliip by plotting both the original observations and the curve corre- 
sponding to the equation. Figure 21 shows this comparison in terms 
of the logarithmic values used in the computation and with the 
logarithmic values of the function (which, of course, is a straight 
line) . It is seen that this straight line seems to fit the original values 
quite closely; they fall very close to it, above and below, in such a 
random fashion that no other type of curve seems necessary. 



124 


SIMPLE CURVILINEAR REGRESSION 


The comparison may also be made in terms of the original values, 
using the estimated values of the curve transformed from logarithms 
back to real numbers. Figure 21 shows the comparison of these values. 
Here again, the demand curve is seen to be a satisfactory '"fit” to 
the actual data.^® 

The economic hypothesis as to the relation between price and con- 
sumption would therefore seem to be borne out so far as this par- 
ticular illustration is concerned, and with the assumptions stated. 
The size of the constant, b, — 0.42, indicates that anywhere along the 
curve a 1 per cent increase in the price of hogs is accompanied by 
approximately 0.4 per cent decrease in hog consumption, or vice versa}^ 


Loa of 
consumprion 



Actual consumption 
Billions, of pounds 



' 25-26 


7 8 9 1(5 II la 13 

Actual price, in cents per lb. 


Fio. 21. The relation of consumption of hog products to hog prices, fitted by a 
logarithmic demand curve, both in logarithms of consumption and price and in 

natural numbers. 


The wheat-protein example, on the other hand, illustrated a case 
where there was no logical basis for the use of any particular equation 
and where a freehand curve was therefore as satisfactory as any other 
type and gave a better fit than any of the analytical types which 
were tried. As has been stated, the great majority of the problems 
in the natural and social sciences are probably of this type, where 

Six observations, such as used in this case, arc far too few to give stable or 
dependable results in price analysis or any other form of correlation, A curve 
from a sample of six observations is still loss reliable than is an average from a 
sample of six observations. The close fit of the line to the observations in this 
case is partly due to the small number of observations utilized. The student can 
check this by recomputing this example including additional data for a longer 
period, say through 1937-1938, as given in Agricultural Statistics, p. 327, U. S. De- 
partment of Agriculture, 1939. 

In calculating this simple illustration, no attempt has been made to allow for 
the effect of changes in other factors which might also influence hog prices, such as 
the level of consumer buying power, the supplies or prices of other competing meat 
animals, or the changes in export demand. Chapter 23 discusses actual price anal- 
yses involving much more elaborate work than this shown here. 



LIMITATIONS IN ESTIMATING 


125 


the relation can be measured even though the specific causes for it 
cannot be stated in mathematical language. Only where the relations 
can be explained on some logical basis which lends itself to mathe- 
matical statement is there justification for a large amount of work 
to a specific formula; and even then, if it is found that that 
particular formula does not give as good a “fit’^ as a simple freehand 
curve, there would be question as to whether the hypothesis was in 
agreement with the facts in that particular case. 

Limitations in estimating one variable from known values of an- 
other. The methods shown so far provide a definite technique by which 
an investigator can determine the way in which the values of one 
variable differ as the values of another related variable differ. These 
same operations afford a basis for estimating values of the dependent 
variable from given values of the independent variable, for cases in 
addition to those from which the functional relation was determined. 
Whether such estimated values, for cases not included in the original 
study, can be expected to agree with the true values if they could be 
determined, depends upon two groups of considerations: (a) the de- 
scriptive significance of the curve and (b) its representative significance 
when it comes to applying to new observations. 

These two groups of considerations apply (a) to exactly what a 
given curve means, with regard solely to the particular cases from 
which it was determined; and (h) the significance of the curve with 
regard both to the ability of those observations to represent the uni- 
verse (whole group of facts) from which they were drawn and the 
ability of the curve to represent the true relations existing in that 
universe. This second group involves an extension of the points which 
were raised in the first chapter as to the reliability of an average; 
discussion of these questions will be deferred to Chapters 18 and 19. 

Just as an average computed from a sample may differ more or 
less widely from the true average of the universe from which that 
samidc was drawn, so a regression line or curve determined from a 
sami)le may differ more or less widely from the true regression in the 
universe. The following chapter discusses this problem, and Chapter 
18 presents methods of estimating how far the regression line or curve 
from an individual sample may miss the true regression of the uni- 
verse. 

The representative significance of a curve depends upon the num- 
ber of observations from which its shape was determined and how 
closely the curve as determined “fits” those observations. Since the 
number of observations usually differs along the different portions of 



126 


SIMPLE CURVILINEAR REGRESSION 


a curve, it may be much more reliable in its central portions, where 
the bulk of observations occurs, than in the extreme portions where 
the number of observations may be much less. This may be espe- 
cially marked in the case of complex curves fitted by mathematical 
means, where single extreme observations may have a material effect 
upon the shape of the end portions. In any event, only those por- 
tions of the curve where there are enough observations to make its 
shape and position definite should be regarded as statistically de- 
termined; the end portions, when dependent upon a few observations, 
should either not be used at all or else stated as very rough indica- 
tions of the true curve. 

It is particularly to be noted that determination of the line or 
curve of relationship gives no basis for estimating beyond the limits 
of the values of the independent variable actually observed. No mat- 
ter whether a formula has been fitted or not, any attempt to make 
estimates beyond the range of the original data by “extrapolation,^^ 
i.e., by extending the curve beyond the range of the observed data, 
gives a result that is not based on the statistical evidence. In case 
a formula has been used which has a good logical basis, extrapolation 
may give a result which it is logical to expect — but its reasonableness 
rests on the validity of the logic rather than on a statistical basis. 
The statistical analysis indicates only what the relations are within 
the range of the observations which are used in the analysis. 

The “closeness’’ with which the line or curve fits the original data 
is another criterion of the reliance which can be placed in it. If the 
data all fall quite close to the line, that fact inspires more confidence 
in it than if they differ widely and erratically from it. But there are 
special statistical measures of just what this “closeness” is, and they 
will be given separate considerations in the next chapter. 

As noted earlier, many more cases are required to determine a 
relation with any degree of dependability than were used in the 
hog-consumption example just considered. That example was given 
to illustrate the type of problem where a definite equation might be 
applied but not as an illustration of a real research problem. 

Summary. In some functional relations, the change in the de- 
pendent variable with changes in the independent variable cannot 
be represented by a straight line. Such a relation may be represented 
by a curve showing the value of the dependent variable for each par- 
ticular value of the independent variable. Curves may be fitted 
to given sets of observations either by use of mathematical functions, 
^uch as parabolas, logarithmic curves, and hyperbolas, or by various 


SUMMARY 


127 


processes of freehand smoothing. When there is some logical basis 
for the selection of a particular equation, the equation and the corre- 
sponding curve may provide a definite logical measurement of the 
nature of the relationship., When no such logical basis can be de- 
veloped, a curve fitted by a definite equation yields only an empirical 
statement of the relationship and may fail to show the true relation. 
In such cases a curve fitted freehand by graphic methods, and conform- 
ing to logical limitations on its shape, may be even more valuable as 
a description of the facts of the relationship than a definite equation 
and corresponding curve selected empirically. 

In any event, estimates of the probable value of the dependent 
variable cannot be made with any degree of accuracy for values of 
the independent variable beyond the limits of the cases observed ; and 
can be made most accurately only within the range where a consider- 
able number of observations is available. It may be possible to 
extrapolate the curve if its equation is based on a logical analysis 
of the relation as well as on the cases observed; but in that case 
the logical analysis, and not the statistical examination, must bear the 
responsibility for the validity of the procedure. 

Note 1, Chapter 6. The methods described in this chapter have been illus- 
trated by determining the curve expressing the average change of percentage of 
protein with changes in percentage of vitreous kernels. In more general terms, that 
is, they have been limited to determining the relation 

Y=J(X) 

Exactly the same methods can be used to determine the reverse regression, which 
would show the average change in percentage of vitreous kernels with a given 
change in percentage of protein. Although this regression is not precisely the recip- 
rocal of the other, it will usually be found that, where a curve rather than a straight 
line is necessary to represent one regression, a curve will similarly be needed for 
the other regression. It will not necessarily be a curve of the same shape, how- 
ever, or one that cjin be represented by the same equation 

Note 2, Chapter 6. When an equation is used with the dependent variable 
stated as a logarithm, as types (b) and (c) on page 93, the further assumption is 
involved that the errors to be minimized vary proportionately with the size of 
the dependent variable. The standard error of estimate also must be stated as a 
perciontage of the value estimated, rather than as a natural number. For an 
('xample of a problem where the range of error increases with the size of the 
(k'pondont variable, and where a logarithmic equation would therefore be justified, 
SCO Figure 23, on page 164. 



CHAPTER 7 


MEASURING ACCURACY OF ESTIMATE AND DEGREE OF 

CORRELATION 

The methods developed up to this point may he used to estimate 
the values of one variable when the values of another are known or 
given. They also furnish an explicit statement of the average dif- 
ference or change in the values of the estimated or dependent variable 
for each particular difference or change in the value of the known or 
independent variable. But that is not enough. In addition it is fre- 
quently desirable to answer three queries: (1) How close can values 
of the dependent variable be estimated from the values of the inde- 
pendent variable? (2) How important is the relation of the dependent 
variable to the independent variable? (3) How far are the regres- 
sion curve and these relations, as shown by the particular sample, 
likely to depart from the true values for the universe from which the 
sample was drawn? Special statistical devices, termed (1) the stand- 
ard error of estimate and (2) the coefficient and index of correlation, 
have been developed to meet the need indicated by the first two 
questions. Error formulas and knowledge of the distributions of 
these coefficients, and standard errors for the regression line or curve, 
provide approximate answers for the third, under the assumption that 
the conditions of sampling are ideal (an assumption rarely valid 
even in experimental work). 

The Closeness of Estimate — Standard Error of Estimate 

Attention has previously been called to the fact that when some 
dependent variable, such as the distance required for an automobile to 
stop after the brake is applied or the protein content in wheat samples, 
is estimated from another variable, such as the speed at which the car 
is moving or the proportion of vitreous kernels in the sample, the 
estimated values in many cases will not be the same as the values 
of the dependent variable that were originally observed. These dif- 
ferences are obviously due to residual causes; that is, to variations 
in the dependent variable which were unrelated to changes in the par- 

128 



FOR LINEAR RELATIONS 


12a 


ticular independent variable used in the analysis. For that reason 
the differences between the estimated values and the actual values are 
termed residual dififerences or, more simply, residuals. 

For linear relations. The meaning of the residuals and their use 
in determining the standard error of estimate and the coefficient and 
index of correlation can best be understood if illustrated by a concrete 
case. Such an illustration is given in Table 28. Here 18 observations 
of the number of days {X) that horses worked on different farms 
and the quantity of grain fed each horse (7) have been fitted by a 
straight line to estimate the quantity of feed from the days of work. 
The estimated quantities, 7', and the residuals, 0 , or differences be- 
tween the estimate and the actual, are also shown. 

TABLE 28 

Days Worked by Horses, Grain Fed per Horse, and Grain .Estimated prom 

Days op Work 


Days worked 

X 

Grain fed, in 
hundred weight 

Y 

Estimated grain 
fed* 

Y' 

Excess of actual 
over estimate 

z 

107 

49 

48.0 

1.0 

70 

28 

40.9 

-12.9 

81 

44 

43.0 

1.0 

57 

36 

38.4 

- 2.4 

87 

68 

44.2 

13.8 

114 

38 

49.4 

-11.4 

73 

49 

41.6 

7.6 

74 


41.7 

11.3 

42 

33 

36.5 

- 2.6 

90 

45 

44.8 


100 

69 

46.7 

12.3 

69 

39 

38.8 

0.2 

80 

38 

44.0 

- 6.0 

89 

41 

44.6 

- 3.6 

98 

42 

40.3 

- 4.3 

96 

46 

46.7 

- 0.7 

70 

39 

42.1 

- 3.1 

98 

46 

46.3 

- 0.3 


* Computed by regression formula Y « 27.43 + 0.1927X. 


The residuals vary from 4-13.8 to —12.9. If we wish to say how 
large they are on the average, we can ignore the plus and minus signs 
and compute the average deviation. For the 18 residuals in Table 










130 MEASURING THE DEGREE OF CORRELATION 

28, the average deviation is 5.25, and the standard deviation is 7.13. 
If these residuals are grouped in a frequency distribution, they fall 
as shown in Table 29. 

The standard deviation of z is different from the standard deviations 
previously computed. Instead of showing the standard deviation of 
grain fed from the mean quantity (that is, <ry ) , it shows the standard 

TABLE 29 


Fkeqxjbncy Distribution op Residuals in Estimating Grain Fed 


Residual* 

Number of 
times occurring 

Residual* 

Number of 
times occurring 

-16 to -12 

1 

Oto + 4 

4 

-12 to - 8 

1 

4- 4to + 8 

1 

— 8 to — 4 

2 

+ 8 to +12 

1 

— 4 to 0 

6 

+12 to +16 

2 


* Ab stated in Chapter 1, -12 to -16 means from -16 up to, but not including, -12; and so 
on for the other groups. 


deviation around a changing quantity, depending on the number of 
days worked. The Oz is thus the standard deviation around the fitted 
line of relation, and may be indicated graphically on a correlation 
chart as a certain area above and below the fitted line. (Note Figure 
22, page 151 of Chapter 8.) 

The standard deviation is 7.13, so we should expect two-thirds of 
the residuals to come between +7.13 and — 7.13. Of the 18 cases, 
12 came within this range of the line, or 67 per cent of all the cases. 
Similarly, only 5 per cent of the cases would be expected to fall out- 
side the range ±:2a, or below — 14.3 or above +14.3- Actually 
none come outside this range, which is close to the expected propor- 
tion for a normal distribution with this limited number of observations. 

Where the same set of conditions prevails as those under which 
the original data were selected and only the independent variable is 
known, it may be desired to estimate the probable value of the de- 
pendent variable from the known value of the independent. Thus if 
the number of days that horses work on other farms in the same area 
is known, it may be desired to estimate the quantity of grain that 
will be needed to feed them. Or in a case where yield of cotton 
with various applications of irrigation water has been determined 




FOR CURVILINEAR RELATIONS 


131 


(note the example in the next chapter) it may be desired to estimate 
the most probable yield on other fields, solely from the amount of 
water applied. In case the estimates were to be made for new observa- 
tions taken from the same “universe*' — for example, on the same soil 
type, in the same area, and for the same year — as were the previous 
samples, a knowledge of the standard deviation of the residuals for 
original samples gives a basis for judging how closely the new esti- 
mates are likely to approximate the true, but unknown, yields for the 
new observations. Similarly in the feeding case it is evident that the 
errors of estimate will not often be greater than 14.3 hundred weight 
of grain, and usually will be less than 7.1 hundred weight. 

Since the standard deviation of the residuals does thus serve to 
indicate the closeness with which new estimated values may be ex- 
pected to approximate the true but unknown values, it has been 
named the standard error of estimate} 

The symbol S is used to denote the standard error of estimate. 
Sy.x indicates the standard error for estimates of Y made from a linear 
relation to A"", by the equation Y = a + bX, Similarly, Sy.f(x) would 
indicate the standard error for estimates of Y made on the basis of a 
freehand curve relation to A'', as indicated by the equation Y = f{X). 

The standard error of estimate is therefore defined by the two 
equations: 


^2 2 


92 _ 2 


n 

n 


( 20 . 1 ) 


The standard error of estimate in estimating grain fed the horses 
from number of days worked, by the linear equation, is therefore 7.13 
hundred weight. 

For curvilinear relations. The calculation of the standard error 
where a eurvilinc'ar function is used to express the relation may also 
be illustrated by tlic horse-feeding data. From a freehand curve, 
fitted by metliods already described, estimates of Y from the relation 
Y ~f (A") were obtained, as shown in Table 30. 

The standard deviation of the new residuals is 6.85. This is then the 
standard error of estimate for estimates based on the curve. 

The standard error of estimate of 6.85 from the curve, compared 

1 Chapter 19 gives more refined measures of the accuracy with which estimates 
may be made for new observations. 



132 


MEASURING THE DEGREE OE CORRELATION 


with that of 7.13 from the straight line, indicates that in both cases the 
amount of feed fed horses in a year can be estimated, for the cases 
included in the sample, from the number of days they work in a year 
with a standard error of between 675 and 725 pounds. It appears 
at this stage that the estimates made on the basis of the curvilinear 
relation are a little more reliable than those based on the linear 
relation. 


TABLE 30 

Days Worked by Horses, Grain Fed per Horse, and Grain Estimated from 
Days of Work, by Freehand Curve 


Days worked 

X 

Grain fed, in 
hundredweight 

Y 

Estimated grain 
fed 

yn 

lOxcess of actual 
over estimate 
z" 

107 

49 

46.5 

2.5 

70 

28 

41.4 

-13.4 

81 

44 

44.2 

- 0.2 

57 

36 

37.4 

- 1.4 

87 

58 

45.5 

12.5 

114 

38 

46.5 

- 8.5 

73 

49 

42.2 

6.8 

74 

53 

42.5 

10.5 

42 

33 

32.5 

0.5 

90 

45 

45.9 

- 0.9 

100 

59 

46.5 

12.5 

59 

39 

38.1 

0.9 

86 

38 

45.2 

- 7.2 

89 

41 

45.8 

- 4.8 

98 

42 

46.5 

- 4.5 

95 

45 

46.4 

- 1.4 

76 

39 

43.0 

- 4.0 

98 

46 

46.5 

- 0.5 


The standard error of estimate can also be used to indicate the prob- 
able reliability of a series of estimates of the values of the deiiendent 
variable for new observations when only the values of the inde- 
pendent variable are known, but only where it is definitely known that 
the new cases are drawn at random from exactly the same universe 
— the same set of conditions — as were the observations from which 
the relation was determined. In case they do not represent exactly the 
same conditions — as if, for example, they represent a different period 



ADJUSTMENT OF STANDARD ERROR OF ESTIMATE 133 


of time ® — ^then the standard error of estimate has meaning only with 
respect to the scatter of the residuals around the regression line for the 
cases used in determining the relationship. It measures (when ad- 
justed) what the differences probably would have been in the universe 
from which the observations came but does not give more than a clue 
or a possible indication as to what the differences may be when the 
same relations are applied to data from new or different conditions. 

Adjustment of standard error of estimate for the number of ob- 
servations. The standard deviations of a series of samples drawn from 
any stable universe will vary from one to another, owing to statistical 
fluctuations. The same is true for the standard error of estimate 
computed for a fitted line. The standard deviations, or standard errors 
of estimate, not only vary but on the average also are somewhat 
smaller than the result that would be obtained from a large sample 
from the same universe. Because of this tendency of the standard 
error of estimate from the sample to understate the standard error in 
the universe, an adjustment is necessary. An unbiased estimate of the 
value of the standard error of estimate for the entire universe may be 
calculated from the standard error of estimate for the sample by the 
use of the following equations: 


2 ct2 

•^2 nSy.x 

n — 2 /?. — 2 

(21.1) 

hence 

J n \ 

An -2/ 

(21.2) 

knd for ciirviliiu'Mi* functions 


2 Ci2 

— 2 ncr^ff 

71 — m n — 771 

(22.1) 

hence 

, n 

n — 771 71 — m 

(22,2) 


In those equations, is used to indicate the estimated standard 
error of estimate for the universe, just as a was used (in Chapter 2) 
to indicate the cstimatc'd standard deviation in the universe from 
which the sample was drawn. 

Sec Cluijilcr 2, 15, for Iho oilier <*oii(lil ions a.ssuined lieforo error formulas 

exiicl ly. 



134 


MEASURING THE DEGREE OF CORRELATION 


In equations (21.1) to (22.2), n stands for the number of observa- 
tions. In equations (22.1) and (22.2), m stands for the number 
of constants in the regression equation, such as a, 6, and c. In the 
case of a parabola of the second order (type a), m would be 3; for 
a cubic parabola (type /) , it would be 4. Where a freehand curve has 
been used, it is necessary to estimate how many constants would be 
needed to represent the curve mathematically. (See pages 76 to 81 
for the constants needed to represent various shapes of curves.) 

The standard error of estimate in estimating grain fed the horses 
by the linear equation, after the standard deviation of the residuals 
is adjusted by equation (21.1), works out to be: 


o2 

Oy,x 




yx 


rurl 
n — 2 

18(7.13^) 

18-2 

7.56 


57.19 


The new value indicates that the errors in estimating grain from days 
worked, when the estimate is made for new observations drawn at 
random from the same universe, will run slightly larger than was 
indicated by the residuals for the cases included in the study, as 
tabulated in Table 29. 

When the standard deviation for the curvilinear function is cal- 
culated by equation (22.1) , a different result from that before appears. 
If it is assumed that the regression curve used could have been rej) re- 
sented mathematically by an equation with three constants (such as 
a parabola) then the correction works out to be: 


Si 


yf(.x) 


naji 
n — m 


18(6.85^) 

18-3 


= 56.31 


Sy*f(x) — 7.50 

The adjusted standard error of estimate for the curvilinear rela- 
tion, 7.50, is barely smaller than that for the linear equation, 7.56. 
This indicates that when estimates are made for new observations 
from the same universe, the straight line is likely to give al)out as 
reliable results as is the regression curve. Not unless the adjusted 
standard error for the curve is materially smaller than for the straight 



UNITS FOR STANDARD ERROR OF ESTIMATE 


135 


line can the curvilinear regression be expected to improve the accuracy 
of estimate.® 

Units of statement for standard error of estimate. The standard 
error of estimate is necessarily stated in exactly the same kind of 
units that the original dependent variable is stated in. Where the 
dependent variable is stated in feet, as in the automobile problem, 
the standard error of estimate will be in feet; where it is in percentage 
points, as in the wheat problem, the standard error will be in per- 
centage points; and where it is in logarithms, as in Table 27, the 
standard error will be in logarithms. Thus in a case like that shown 
in Table 27, the standard error might be the logarithm 0.038. That 
means that the logarithm of the estimates is likely to agree with the 
logarithm of the true values to within ±: 0.038, two-thirds of the time. 
With an estimated logarithm of 1.00, the logarithm of the true value 
would then be between 0.962 and 1.038, two-thirds of the time. In 
terms of anti-logarithms, this gives values of 9.16 and 10.91, or 
between 9.1 per cent above and 8.4 per cent below the value 10. Since 
a given logarithmic difference always means the same percentage 
difference, no matter how large or how small the base to which it is 
applied, when the standard error is thus stated in logarithms it indi- 
cates the range witliin which the estimates may be expected to be 
reliable, not as absolute quantities such as pounds of grain but as 
percentages. In terms of absolute differences, the estimate might be 
expected to be right within 100 pounds, no matter whether the quantity 
fed was estimated at 1,000 pounds or 4,000 pounds; whereas using 
logarithms, if the estimate was expected to be right within 100 pounds 
for an estimate of 4,000 pounds, it would be expected to be right 
within 25 pounds for an estimate of 1,000 pounds. 

The standard error of estimate is thus computed from the stand- 
ard deviation of the residuals for the cases on which the relation is 
based. It indicates the closeness with which values of the dependent 
variable may be estimated from values of the independent variable. 
Its exact interpretation differs with the particular units in which the 
values of the dependent variable are expressed. 

^ The values of Sya, are subject to errors of sampling, just as the values of are 
subject to errors of sampling. Accordingly, the values of must be regarded 
only as osiirtiMtos of tlie l.nio values, S,, \vhi(di prevail in the universe from which 
the sample is drawn. Also, it must bo rcinonibered that the adjustment, m, for the 
number of degrees of freedom removed, is only an approximate adjustment in the 
case of a freehand curve, and that this introduces a further limitation to the ac- 
curacy of 



136 


MEASURING THE DEGREE OF CORRELATION 


The Relative Importance of the Relationship — Correlation 

In certain problems it might be found that every bit of variation in 
one variable could be explained, or accounted for, by associated dif- 
ferences in the value of an accompanying variable. Thus all the varia- 
tion in the volume of a cube can be explained by the corresponding 
difference in the length of one side. No other variable is needed to 
account for the volume of the cube. If we know what the length of 
the side is, we can compute accurately what the volume will be. All 
the variation in volume can therefore be said to be explained, or 
accounted for, by the known relation to the length of the side. 

In most problems with which the statistician has to deal, however, 
all the variation cannot be explained by the relation to another 
variable, and residual variation is left over. As has just been pointed 
out, this residual variation can be measured and used as an indica- 
tion as to the errors in estimate. 

It is obvious that if no relation has been found, the independent 
Variable considered does not explain any of the observed variation in 
the dependent variable, and so none of the variation can be explained 
as due to, or associated with, the independent variable. If, as in the 
case of the cube, the estimates all agree exactly with the actual values, 
there are no residual elements, and the variation is perfectly ex- 
plained. But between these two extremes lie the cases of partial 
explanation, where a portion of the variation can be explained by the 
independent variable considered, and a portion cannot. In the auto- 
mobile case, part of the variation in stopping distance, but not all, 
was associated with the speed; in the wheat case, part of the varia- 
tion in protein content, but not all, could be estimated from variations 
in the proportion of vitreous kernels; and in the horse-feed case, part 
of the variation in feed fed, but not all, could be accounted for by 
variations in number of days worked. In many problems it is of inter- 
est to determine Vhat proportion of the variation in the dependent 
variable can be explained by the particular independent variable con- 
sidered, according to the relation observed. 

Measurement of the relative importance of the relation between 
two variables calls for a different type of statistical constant than the 
standard error of estimate. The standard error of estimate simply 
indicates the size of the residuals without regard to the amount of 
variation in the dependent variable as first observed. If the standard 
error of estimate for a cotton-yield problem, for example, were 50 
pounds, that would be the standard error no matter whether the 



LINEAR RELATIONS— COEFFICIENT OF CORRELATION 137 


yield of cotton in the original cases varied only between 200 and 400 
pounds or between 50 and 1,200. If the yields varied only between 
200 and 400 pounds, and the standard error was 50, practically all 
the variation in the original yields would still be left in the residuals ; 
whereas if the yields varied between 200 and 1,200 and the standard 
error was 50, only a very small portion of the original variation would 
be left in the residuals. Yet the standard error of estimate would 
be of the same size in both cases. 

What is needed to show the relative importance of the relationship 
is some measure which shows what proportion of the original varia- 
tion has been accounted for. The amount of the variation in the series 
of estimated (Y') values shows how ‘much variation has been ac- 
counted for. All that need be done is to compare that variation 
with the variation in the original series to determine what propor- 
tion of the variation has been explained. 

The standard of deviation may be employed for the purpose of 
measuring the amount of variation. The actual values, F, shown in 
Table 28, have a standard deviation of 7.92. The values estimated 
from the linear regression equation, Y\ have a smaller standard devia- 
tion, 3.47. If we determine how large the latter is compared to the 
former, we get cxi/' Vj; — 3.47/7.92, or 0.44. This is then a measure of 
the importance of relationship between the two variables — or the 
amount of correlatio-rh as it is termed — according to the particular type 
of curve for which the relationsliip was determined. 

Linear relations — coefficient of correlaion. Where the relationship 
between the two variables is found or assumed to be a straight line, 
the value of ay' /ay is termed the coefficient of correlation. The symbol 
r is used to represent it. When values of Y are estimated from values of 
X according to a straight-line equation, then tlie proportion of the 
variation in Y which is so accounted for is indicated by the notation 
Tya,, which is read ^dhe coefficient of correlation between Y and 

The coefficient of correlation may tlierefore be defined 

r,. = (23.1) 

O-j; 

This formula gives values of r identical witli those given by the 
more usual formula, equation (27), presented subsequently on page 
148, as can be proved by simple algebra (see Note 3a, Appendix 2). 

The method of comiiuting the coefficient of correlation which has 
just been shown demonstrates that the coetfK'ient is simply a measure 
of how large the variation in the estimated values is, in proportion to 



138 MEASURING THE DEGREE OF CORRELATION 


the variation in the original values. The coefficient of correlation thus 
measures the proportion of the variation in one variable which is 
associated with another variable, and therefore is a measure of the 
relative importance of the concomitance of variation in the two factors. 

Curvilinear relations — ^index of correlation. In case the relation 
has been determined as a curvilinear fimction instead of a straight 
line, the ratio ary*'/ cry is termed the index of correlation, and is repre- 
sented by the symbol pya,. 

The index of correlation may therefore be approximately defined as 


Pyx 


CFytt 


(23.2) 


(A more exact value for the index of correlation is given in equation 
(29) on page 156.) 

Computing the index of correlation for the horse-feed case, o-j,"/ (Ty 
= 3.86/7.92 = 0.49. From this figure, it would appear that the cor- 
relation is definitely higher for the curve than for the straight line.'* 

Characteristics of the measures of correlation. It should be noted 
that in the case of straight-line relations, if the line has a positive slope, 
so that as X increases the values of Y' (the estimated values of Y) 
increase, the correlation is said to be positive, and a plus sign is affixed 
to the correlation coefficient. Similarly, if the line has a negative slope, 
so that as the values of X (the independent variable) are larger, the 
values of Y' (the estimated values for the dependent variable) become 
smaller, the correlation is said to be negative, and a minus sign is 
affixed to the correlation coefficient. The coefficient of correlation thus 
takes the same sign as the constant h of the corresponding linear 
equation. In the case of the correlation index, the curve may bo 
positive in one portion and negative in another, so no sign is used, 
and reference to the curve is necessary to indicate the nature of the 
relationship. 

In a case where the observed relation explains all the variation in 
the dependent variable, the estimated values will be identical with the 
actual values. The standard deviation of Y' will therefore be exactly 
as large as the standard deviation of Y, and the ratio o-y'/ ay will equal 
1.0, This is termed perfect correlation, and is indicated when p — 1.0, 
or when r = + 1.0 or — 1.0. 


^ In some statistical texts, Tyw is used to represent the correlation obsorvod in a 
given sample, and is used to represent the true correlation existing in the' uni- 
verse from which that sample was drawn. The student should not confuse that use 
of the Greek rho, p, with the way it is used here. 



CHARACTERISTICS OF THE MEASURES OF CORRELATION 139 


At the other extreme of no relation, no variation can be accounted 
for by the particular independent variable considered, and the estimated 
values Y' are therefore all the same, being merely the average of F. In 
that case the standard deviation of the estimated values is zero, and the 
ratio cTy'/o-j, == 0/<Ty = 0. The case of complete absence of correlation, 
therefore, is indicated by values of 0 for either r or p. 

The possible values of the coefficient of correlation therefore range 
from 0 to + 1.0 or to — 1.0; whereas the values for the index of 
correlation range from 0 to 1.0. Since most problems with which the 
investigator has to deal involve cases that are intermediate, where there 
is some but not perfect correlation, it is these intermediate cases which 
are of most importance. The precise significance of different values 
of r and p will next be considered. 

Where both X and Y are assumed to be built up of simple elements 
of equal variability, all of which are present in Y but some of which are 
lacking in X, it can be proved mathematically that measures that 
proportion of all the elements in Y which are also present in X. For 
that reason in cases where the dependent variable is known to be 
causally related to the independent variable, may be called the 
coefficient of determination. It may be said to measure the percentage 
to which the variance in Y is determined by X, since it measures that 
proportion of all the elements of variance in Y which are also present 
in .Y."’ The coefficient of determination, da^yj may be defined by the 
equation 

^xy — (24.1) 

AVherc some elements are present in each variable which occur in the 
other, the co(‘fricient of determination is the product of these joint pro- 
portions. That is, if 2/3 of the elements in are the same as 2/3 of 
the elements in Y, then the coefficient of determination will be equal 
to 4/9. 

Although the coefficient of correlation was the earliest measure used, 
it can be seen that it may be misinterpreted. Thus if half the variance 
in Y is directly due to A", the coefficient of correlation would be 0.707 
Yet the coefficient of alienation*^ is also 0.707. If instead 
the coefficient of determination is used, when we know that that is 0.50, 
we know at once that the coefficient of non-determination^ is also 

See Note 4, Appendix 2. 

° See Note 5, Appendix 2, for a fuller definition of these ne \9 terms. 



140 


MEASUEING THE DEGREE OF CORRELATION 


0.50; or if the determination is 0.60, the non-determination is 0.40. 
The coefficient of non-deterndnation may be defined. 

= (24.2) 

Since this is the most direct and imequivocal way of stating the pro- 
portion of the variance in the dependent factor which is associated 
with the independent factor, it may be used in preference to the other 
methods. 

Where curvilinear relations have been used in determining the rela- 
tionship, the term index of determination will be used to denote the 
value of thus retaining the same relation to the index of correlation 
that the coefiScient of determination bears to r, the coefficient of correla- 
tion. The index of determination, dy.f(a!) may be defined 

dy.f(x) = Pyx (24.3) 

When an expression is used such as “Forty per cent of the variance 
in yield is due to differences in rainfall,’^ it will be understood that it is 
either the coefiBicient or the index of determination which is being 
stated. 

Relation of the measures of correlation to the two regression lines. 
Attention has been called in several previous chapters to the fact that 
two regression lines can be fitted to any set of observations. These 
are denoted by the two coefficients by^ and b^y in the two equations 

Y = ayx ”1” byx X 

and 

^ axy “h bxy Y 

Although there are these two regression lines, there is only a single 
coefficient of correlation for any one set of observations. In fact, the 
coefficient of correlation has certain definite relations to the two lines. 
It indicates how closely the two lines approach one another. The 
higher the correlation, the closer the two lines come together; the 
lower the correlation, the farther they diverge. In perfect correla- 
tion (r = liz 1) the two lines coincide. When there is no correlation 
(r = 0) the two lines will be at right angles to one another. 

This relationship is so exact that the value of the correlation coef 



ADJUSTMENTS FOR NUMBER OF OBSERVATIONS 


141 


ficient can be computed from the slopes of the two lines according to the 
equation 

Vyx hxy (24.4) 

It follows from this equation that when r l,byx = and therefore 

the two regression lines will coincide.'^ 

Although there can be only a single coefficient of correlation for a 
single set of observations, there can be tw^o indexes of correlation. 
This follows from the fact that the curve which expresses the relation 

7=/(X) 

may be a curve of quite a different type from that which expresses 
the relation 

X = m 

Accordingly, the index of correlation, which measures the closeness 
of correlation according to the first curve, may be quite different from 
the index of correlation, p^yj which measures the closeness according 
to the second curve. Only in the special case where all the observations 
lie precisely along the curve, so that p = 1, will the two indexes have 
the same value. In that case it will also hold true that the curves 
Y = f{X) and X == <!}{¥) will be identical with the coordinates re- 
versed. 

There is only one' correlation coefficient, r, however. It measures 
the correlation according to both regression lines. Since r = rya, = r^yj 
either notation can be used interchangeably. 

Adjustments for number of observations. Where the number of 
cases in the sample is not very large, both the coefficient and index 
of correlation require certain adjustments before the values calculated 
from the sample, as given by equations (23.1) and (23.2), can be 
used to indicate the values which arc most probably true for the 
universe from which that sample was drawn. Without correction, 

^ This property of the two lines can bo used to estimate graphically the close- 
ness of correlation. When the two variables, A'' and Y, are stated in terms of unit 
standard deviation, X/ctx and K/cj/, by dividing eadi observation by the standard 
deviation of the series, the coefficient of correlation will then bo a precise mathe- 
matical funct ion of the angle between the two lines. By stating the variables in 
this way, plotting them on a dot chart, and drawing in the two lines graphi(ially, a 
fairly close approximation to the coefficient can be obtained. 



142 


MEASURING THE DEGREE OF CORRELATION 


the observed coefficient or index of correlation tends to exceed the true 
correlation.® 

Denoting the adjusted constants as Tya and Jya} the adjustment 
formulas are: 

4 = 1 - (1 - 4) (^) (25) 

P®* = 1 - (1 - (26) 


If the value to the right of the first “1 — in equation (25) or (26) 
exceeds imity, 0 must be taken for the value r or p. 

In these equations, n and m have the same meaning as in equations 
(22.1) and (22.2), presented on page 133. The adjusted value f is the 
value which most probably exists in the universe, if the correlation is 
0.80 or better. In half the samples, the value f will be as large as the 
true value; and in half, it will be smaller than the true value. If, how- 
ever, the correlation is low, 0.60 or less, f is a somewhat more conserva- 
tive estimate of the true correlation. 

Applying the correction to the value of Vyx previously computed for 
the horse problem, the correlation of grain fed with number of days 
worked is found to be: 


f 2 = 1 - 

’ yx ^ 


[1 - (0.44)2] (18 _ 1) 
18-2 


= 0.1432 


— 0.38 


The index of correlation is even more likely to be spuriously high 
when based on a small number of cases than is the coefiScient of corre- 


® The value of r calculated from a sample is derived from the standard deviation 
of the estimated values o-y/ and the standard deviation of the dependent variable ay. 
It was noted in Chapter 2 that when standard deviations are computed from a small 
sample, they tend to be less than the true standard deviation of the universe, and this 
applies to ay. At the same time, ay^ is determined from a limited number of observa- 
tions. It was already pointed out that a straight line would exactly fit any two 
observations with no residuals at all. When a straight line is fitted to ten observa- 
tions, there are only eight “degrees of freedom” in determining the values a and 6, 
as the “freedom” of two of these observations is used up in the determination. As 
a consequence of these conditions, the ayf tends to be larger than it should be, and ay 
tends to be too small. Hence the quotient, ay>/ay tends to be too large, on the 
average. Also, since ayf tends to be too large, as tends to be too small, and hence 
the observed standard error of estimate also needs correction, as provided in equa- 
tions (21.1) to (22.2Q. 



THE RELIABILITY OF THE REGRESSION LINE 


143 


lation and is even more in need of the adjustment, indicated by 
equation (26) 

Computing the index of correlation for the horse-feed problem, 
with the corrections shown in equation (26) : 



0.1389 


After adjusting, we find that in this case the index of correlation is 
almost the same as the coeflBcient, agreeing with the conclusion shown 
by the two standard errors of estimate. Just as with the standard 
errors, so it is with the correlation — ^not unless the index of correla- 
tion is still definitely higher than the coefficient, after they have been 
adjusted by formulas (25) and (26), can it be said that there is 
definite indication of curvilinear correlation rather than of linear.^® 

It should be noted that in any case the adjustment to r or p is 
small compared with its own standard error — ^that is, the value given 
by the sample may miss the true value in the universe by a margin 
much larger than the difference between the observed value and the 
adjusted value. Chapter 18 discusses methods of estimating the 
probable range of such departures of the observed correlation from 
the true. Even so, the average value from a series of samples always 
tends to have the bias mentioned, and it is worth eliminating this 
average bias as far as possible, even if the adjusted value from an 
individual sample is still subject to a considerable standard error of 
its own. 

The reliability of the regression line or curve and of the measures 
of correlation. Chapter 2 shows how a series of samples drawn from 
the same universe would yield varying estimates of the true average 
in that universe. It also presented methods of estimating how far the 


® The adjusted index of correlation p has the same interpretation as the adjusted 
coefficient of correlation — half of the samples will give values of p which will not 
exceed the true value of p in the universe from which the sample was drawn. 

Just the a and b of the linear equation eliminate two degrees of freedom, a 
curve representing three constants (or more) can be passed exactly through three 
observations (or more) and so may eliminate three (or more) degrees of freedom. 
There is therefore even more tendency for p to be spuriously high than for r, and the 
correction is even more needed. 

Figure F of Appendix 3 for a graphic method of computing adjusted 
coefficients or indexes of coirelation from the unadjusted values. 



144 


MEASURING THE DEGREE OE CORRELATION 


average from a single sample might miss the true average in the uni- 
verse. In exactly the same way, if regression lines or curves are de- 
termined for a series of samples from the same universe, they will 
yield regressions which vary among themselves. Similarly, the coef- 
ficients or indexes of correlation and the standard errors of estimate 
will vary from sample to sample. Standard errors of each of these 
measures are available. They provide estimates of the range from 
the true values in the universe within which two-thirds of the values 
from such samples will fall and of the wider range within which larger 
proportions of the samples will fall. These measures of reliability for 
the sample results are much more complicated, both in computation 
and in interpretation, than the standard error of an average. Ac- 
cordingly, their presentation is deferred to a later chapter (Chapter 18) . 
In addition, the special problem of the reliability of an individual 
estimate for an individual new observation, from the results shown 
by a sample, is treated in a separate chapter (Chapter 19). The 
methods given in the present chapter and Chapter 8 are sufficient for 
determining the correlation and regression as shown in the individual 
sample. Before a student or research worker uses the results of the 
sample to draw more general conclusions as to the relations which hold 
true in other samples or in the universe as a whole, or before he makes 
estimates for new observations, he should master these later chapters 
and should apply the checks and limitations set forth there in stating 
his general conclusions or in making his estimates. 

Summary. This chapter has pointed out that the closeness of 
relation between two variables may be measured either by the abso- 
lute closeness with which values of one may be estimated from known 
values of the other or on the basis of the proportion of the variation 
in one which can be explained by, or estimated from, the accompany- 
ing values of the other. The absolute accuracy of estimate is measured 
by the standard error of estimate, which indicates the reliability of 
values of the dependent variable estimated from observed values of 
the independent value. 

The relative closeness of the relation is best measured by the coeffi- 
cient of determination, in the case of linear relationship, or by the 
index of determination, in the case of curvilinear relationship. These 
measures show the proportion of the variance in the dependent vari- 
able which is associated with differences in the other variable. In 
the case of variables causally related, they measure the proportion of 
the variance in one which can be said to be due to the other. 



SUMMARY 


145 


The best methods of computing the various measures of corre- 
lation will be shown in the next chapter; the methods used in this 
chapter are designed rather to show the significance of the measures 
themselves. ' 

This chapter has also called attention to the fact that the measures 
of correlation obtained from a sample will vary from the true facts 
of the universe, has referred to later chapters where standard errors 
for estimating such variation are discussed, and has warned against 
drawing general conclusions or making new estimates from a single 
sample unless the precautions described in these subsequent chapters 
are observed. 



CHAPTER 8 


PRACTICAL METHODS FOR WORKING TWO-VARIABLE 
CORRELATION PROBLEMS 

Terms ta be used. The preceding discussion has developed the 
naeans by which values of one variable may be estimated from the 
values of another, according to the functional relation shown in a set 
of paired observations. Simple correlation involves only the means for 
making such estimates, and for measuring how closely those estimates 
conform to, and account for, the original variation in the variable 
which is being estimated, for the given set of observations. 

The regression line is used, in statistical terminology, to designate 
the straight line used to estimate one variable from another by means 
of the equation Y = a + bX 

This equation is termed the linear regression equation; and the coeffi- 
cient b, which shows how many units (or fractional parts) Y changes 
for each unit change in X, is termed the coefficient of regression. 

Where a curvilinear function has been determined, either by the 
use of an equation or by graphic methods, the corresponding curve is 
similarly designated as the regression curve. Either the mathematical 
equation or, if none has been computed, the expression 

Y =f{X) 

where the symbol j{X) stands for the relation shown by the graphic 
curve, is termed the regression equation. 

The coefficient of correlation and the index of correlation have 
both been defined as the ratio of the standard deviation of the esti- 
mated values of Y to the standard deviation of the actual values, 
whereas the standard error of estimate has been defined as the stand- 
ard deviation of the residuals from the estimates so made. In the 
case of linear relations, however, the coefficient of correlation and 
the standard error of estimate can both be computed directly from 
the same values as were employed in computing the constants of the 
regression equation. This will be illustrated by the practical example 
which follows. 


146 



WORKING OUT A LINEAR CORRELATION 


147 


Working out a linear correlation. As was illustrated in Chapter 
5, pages 64 to 71, the values for a and b of the regression equation can 
be determined for any two variables, X and 7, between which it may be 
desired to determine the relation, by working out the values. Mg,, My, 
and 2(Xy), and then substituting them in the appropriate equa- 
tions. In order to compute directly the coefficient of correlation, Tofy, 
and the standard error of estimate, Syx, it is necessary only to compute 
in addition the value 27^ and substitute it in appropriate formulas. 
The data given in Table 31 illustrate the necessary operations. 


TABLE 31 

Computing the Values Needed to Determine Linear Regression and 
Correlation Coefficients 



Irrigation water 
applied per acre * 
(X) 

Yield of Pima 
cotton per acre * 
(F) 


XY 

72 


Feet 

Units of ten pounds 





1.8 

26" 

3.24 

46.8 

676 


1.9 

37-^ 

3.61 

70.3 

1,369 


2.5 

45- c, 

6.25 

112.5 

2,025 


1.4 

16 • t 

1.96 

22.4 

266 


1.3 

Q9 > 

1.69 

11.7 

81 


2.1 

44 1 

4.41 

92.4 

1,936 


2.3 

38 

6.29 

87.4 

1,444 


1.5 

28 

2.26 

42.0 

784 


1.5 

23 

2.25 

34.5 

529 


1.2 

18 

1.44 

21.6 

324 


1.3. 

22 

1.69 

28.6 

484 


1.8 

18 

3.24 

32.4 

324 


3.5 

40 

12.25 

140.0 

1,600 


3.5 

66 

12.25 

227.5 

4,225 

Total . 

27.6 

429 

61.82 

970.1 

16,057 

Mean. 

1.97 

1 " ! 

30,64 





* From James C. Muir and G. E, P. Smith, The use and duty of water in the Salt River Valley, 
Agricultural Experiment Station Bulletin 120, University of Arizona, 1927. All the plots were on 
the same type of soil, Maricopa sandy loam. 


The computations shown in this table — squaring both X and Y, cal- 
culating the product XY, summing both X, Y, and the three columns 




148 


PRACTICAL CORRELATION PROCEDURE 


of extensions, and dividing the first two sums by the number of cases to 
give the mean of X and Y — ^provide all the basic data necessary.^ The 
values a and b for the regression equation may next be computed by 
substituting these extensions in equations (9) and (10), which were 
used previously in Chapter 5, page 66. 


S(X7) ~ nM,My 
“ S(X2) - 


970.1 - 14(1.97)(30.64) 
61.82 - 14(1.972) 


125.050 

7.4874 


16.701 


a ='My - bM:, = 30.64 16.701(1.97) =- 2.261 

The regression line, 7 = a + therefore is for this case 
7 = - 2.261 + 16.701X 

The unadjusted coefficient of correlation, may now be computed 
from the following new formula: 


= S(X7) - nM^My 

“ V[S(X2) - [S(y2) - 

970.1 - 14(1.97) (30.64) 

V[61.82 - 14(1.97)2] [16,057 - 14(30.64)2] 


(27) 


0.847 


It should be noticed that the numerator of this fraction is the same as 
that in the equation for h and that half of the denominator is the same, 
except that it is under the radical sign. 

Comparison of equations (9) and (27) with equation (5) for the 
standard deviation 


(Tx = 



- Ml 


shows that they may be written more simply 

(27.1) 


(27.2) 

^ Where the number of cases to be handled is large, various short cuts may be 
used to reduce the volume of computation required in computing the sums of ex- 
tensions 2^2, SXF, and 2y2_ Xhe use of these short cuts is developed in Appen- 
dix 1, pages 455 to 463. 


^ '2{XY)-nM^My 'Zixy) 

Oyx 2 ” 2 

no-g, na^ 

S(Z7) - nM,My Sixy) 

r XV — — 



WORKING OUT A LINEAR CORRELATION 


149 


The second form, in each case, uses the notation 2(xy) for 2(ZF) — 
n{Ma,My) as discussed on page 66.^ The forms shown in equations 
(9), (10), and (27), however, are the ones ordinarily used in actual 
computation, and should be kept clearly in mind. 

Once Vjiy has been computed, the value adjusted for the number of 
cases can then be obtained by equation (25) . 


4 1 (1 4 )(^_ 2 ) 

For the present problem, that becomes 


f 2 _ 

/ X 


[1 - (0.847)2] (14 - 1) 


14-2 


= 0.6939 


fxj, = 0.833 


Knowing fxy, we may next compute the standard error of estimate 
by the following equation: 


^ n — 1 

-4 


,) 


(28) 


/16,0 57 - 14(30.642) 
13 


= V68.62 = 8.28 


[1 - (0.833)2] 


Since this equation includes f^yj already adjusted for the number ol 
observations, no further adjustment is necessary. The standard error 
computed by equation (28) is identical with that obtained by equa- 
tion (21.1), or (21.2), given in the previous chapter. 

As noted earlier, though Vxy = h^y is not the same as hyx* The 
former regression, showing the change in X for each unit change in Y 
(that is, regarding the dependent factor as the independent factor 
instead), is obtained by modifying equation (9) to the following form; * 

S(.Yr) - nMxMy 
2;(F2) - n{My? 

2 The value of 2 (.I*?/) is somoiimcs called the pwdiicl rnorncuL 

^ When the correlation is perfect, so that 1, the two re^^ression coefficients 
will have the definite relation byx =l/£)j-v Under t.hese conditions the regression 
lines will bo identical, no matter which variabhi is regarded as the independent 
variable and which as the dependent. 


150 


PRACTICAL CORRELATION PROCEDURE 


The new regression coefficient, shows the average change in 
water applied with each additional unit (ten pounds) of cotton 
harvested. With the quantity of water subject to human control, as 
in this case, this relation appears to have little meaning. However, 
if it is desired to chart it on Figure 22 along with the other regression 
line, it can be charted according to the linear regression equation 

A- ^xy ”1” 

The value of the new a can be computed by restating equation (10) in 
the form 

Clxy “ 

Equation (28) completes the computation of all the values needed ^ 
except the coefficient of determination, dxyj which is simply That 
is: 

dxy = fly - (0.833)2 = 0.694 

Interpreting the results of a linear correlation. The next step is 
to take the several constants which have been computed and see what 
they mean. 

The coefficient of regression of Y on X, byco = 16.70, shows that on 
the average the acre yield of cotton increases 16.7 ten-pound units, or 
167 pounds, for each additional acre-foot of water applied. The con- 
stant a shows that with no water applied, a yield of — 2.26 ten-pound 
units, — 22.6 pounds, or less than no cotton at all, might be expected. 
Since these results are based on observations extending from 1.2 acre- 
feet of water to 3.5, the relations shown by the regression line do not 
necessarily hold beyond those limits, and it is not certain what the yield 
would be when no water is applied. Extrapolating the regression line 
to that point is only a guess. 

The regression equation 

y = 2.26 + 16.7 (X) 

or 

Yield = — 22.6 + 167 (feet of water) 

then gives the yields of cotton estimated as most likely to be obtained 
from the quantity of water applied within the limits of 1.2 to 3.5 feet. 
Figure 22 shows how these estimated values, along the regression line, 
compare with the actual yields observed. 

^ Except also the calculation of measures of reliability, as explained in Chap- 
ters 18 and 19. 



INTERPRETING THE RESULTS OF A LINEAR CORRELATION 151 

The standard error of estimate, 8.28 ten-pound units or 82.8 pounds, 
shows that the (adjusted) standard deviation of the differences be- 
tween the actual and the estimated values is 82.8 pounds of cotton. 
Two lines have been drawn in Figure 22, at 82.8 pounds above and 
below the regression line. It will be seen that of the 14 cases, 9 
fell between these two lines, or in the zone within one standard error on 
either side of the regression line. 



Fig. 22. Relation of yield of cotton to irrigation water applied ; estimated yields 
from a linear regression and zone of probable yields indicated by the standard 

error of estimate. 

The coefficient of correlation, f^y = 0.83, and the coefficient of 
determination, = 0.69, show that about 69 per cent of the variance 
in the yield of this crop in this area, on the farms from which these 
records were obtained, could be accounted for by the differences in the 
quantity of water used in irrigation. Since this leaves only 31 per cent 
of the variance to be accounted for by all other factors, it would appear 
that the quantity of water applied (or other factors associated with it) 
was the most important factor which was associated with the yield of 
cotton on these farms and on this type of soil. 

The fact that 69 per cent of the variance in yield can be explained 
by corresponding differences in the quantity of water applied does 



152 


PRACTICAL CORRELATION PROCEDURE 


not in itself mean that the differences in irrigation caused the differ- 
ences in yield. For example, it might be possible that the quantity 
of water applied was regulated to conform to the fertility of the land 
and that the differences in yield were really due to the differences in 
fertility. The statistical measure merely tells how closely the vari- 
ance in one variable was associated with variance in the other; 
whether that association is due to, or can be taken as evidence of, 
cause-and-effect relation is another matter, and is outside the scope of 
the statistical analysis. (For more extended discussion of this point, 
see the last two chapters of this book.) 

Working out a curvilinear correlation. The next step is to 
consider whether the straight line is adequate to describe the way that 
the yield increases as more water is applied, or whether a curve had 
better be employed. (This step can be taken before any of the linear 
results are worked out, and, if a curve is decided on, the previous 
work can be skipped entirely, if desired.) 

Before fitting the curve, we must consider what type of curve 
it is logical to expect. In most agricultural production problems, 
diminishing returns are experienced.® That is, the application of suc- 
cessive increments of fertilizer or other productive aid on the same 
areas will be expected to produce a smaller and smaller increase in 
the product. Also, it is known that if too much of some factors are 
applied, the result may be to produce a decline in output. The decline 
after the point of optimum application is reached may be gradual, or 
it may be sudden, owing to a toxic effect of too much of one substance 
upon the plant or animal. These considerations would lead us to ex- 
pect a curve with the following characteristics: 

1. It should rise steeply at first, and then less and less sharply 
until a maximum is reached. 

2. It might show a decline after the maximum is reached, cither 
gradual or sharp. 

3. It would have only the single point of inflection (change of 
direction) at the optimum application. 

These are the conditions we shall apply in fitting the curve. 

Examining Figure 22 more closely, we see that, in the range up 
to 1.8 acre-feet of water, the actual yields lie below the regression 
Une four times, and above four times; in the range from 1.9 to 3 acre- 

William J. Spillman, The Law of Diminishing Returns, World Book Co., 
Yonkers-on-the-Hudson, New York, and Chicago, 1924. 



WORKING OUT A CURVILINEAR CORRELATION 


153 


feet, the actual yields lie above in all four observations; and above 3 
acre-feet the one yield below the line is much farther below than is the 
one above. These facts suggest that a curve convex from above, giving 
lower estimated yields than the straight line for the lowest and highest 
applications of water and higher estimated yields for the intermediate 
applications, would more accurately represent the relations in this 
case. (The number of observations is far too low to serve as a very 
accurate indication of the shape of the curve, but it will serve at 
least as a simple illustration of the way the whole problem may be 
worked through.) 

The next step is to group the observations according to the value 
of X (the quantity of water) and average both X and Y, water and 
yield. In view of this small number of observations, rather large 
groups are taken; were more cases available, the groups might be 
made narrower. 

TABLE 32 

Computation of Group Averages to Indicate Regression Curve — 
Cotton Example 


X (water) 1 to 1.4 

X (water) 1.5 
to 1.9 

X (water) 2.0 
to 2.9 

X (water) 3.0 
to 3. 9 

X 

Y 

A 

Y 

X 



1" 

1.4 

10 

1.8 

26 

2.5 

45 

3.5 

40 

1.3 

9 

1.9 

37 

2.1 

44 

3.5 

05 

1.2 

18 

1.5 

28 

2.3 

38 



1.3 

22 

1.5 

23 







1.8 

18 





Sums. . . 5.2 

65 

8.5 

132 

6.9 

127 

7.0 

105 

Means. . 1.3 

16.25 

1.7 

20.4 

2.3 

42.33 

3.5 

52.5 


These averages are then plotted, as shown in Figure 23, an irregu- 
lar line dotted in connecting them and as smooth a curve as possible 
which fulfills the stated conditions drawn in freehand through the 
averages and the broken line, just as discussed in pages 105 to 110, 
Chapter 6. This then gives the regression curve. It is seen to fit 
the data well, and yet to fulfill the logical conditions stated. The 
point of maximum yield, however, apparently lies beyond the limit 
of the observations. 



154 


PRACTICAL CORRELATION PROCEDURE 


Next the estimated yields for each different application of water 
are read off from this curve, and the difference between the actual 
and the estimated jdelds is determined. These residuals are then 
squared to determine their standard deviation. In case the linear 
correlation has not been previously worked, the yields, or Y values, are 
also squared as shown, so as to determine their standard deviation, 
and so give the basis for measuring the amount of correlation. 

Yield of 



Fig. 23. Relation of yield of cotton to irrigation water applied; estimated yields 
from a curvilinear regression; and zone of probable yields as indicated by the 
standard error of estimate. 

The sum of the 7" values is slightly smaller than the sum of 
the Y values, and the mean of the values is therefore not exactly 
zero, but 0.264. That indicates that the curve shown in Figure 23 
should be shifted up 0.264 unit, or 2.64 pounds, to make the esti- 
mated and actual averages agree.® Representing this curve by f{X), 

® In problems with many observations, the sum of the Y values and of the F" 
values may be determined separately for the several different portions of the curve, 
to see if its position should be shifted in one portion and not in another. This 
process cannot be carried too far, however, for if the divisions are made too small 
the effect will be to make the curve pass through each successive group average, 
without smoothing out the irregularities into a continuous function. 


WORKING OUT A CURVILINEAR CORRELATION 


155 


the regression equation for the curvilinear correlation may therefore 
be written: 


y = fc+/(x) 

7 = 2.64 +/(X) 

TABLE 33 


Computation op Residuals and Standard Deviation fob Cuevtlineab 
Regression — Cotton Example 


Water 
per acre, X 

Yield, in 
ten-pound 
units, Y 

Yield estimated 
from Xj in ten- 
pound units, Y" 

Y-Y", 

(2") 

(2")“ : 

. Y^ . 

1.8 

26 

29.0 

- 3.0 

9.00 

676 

1.9 

37 

31.0 

6.0 

36.00 

1,369 

2.5 

45 

42.8 

2.2 

4.84 

2,025 

1.4 

16 

19.2 

- 3.2 

10.24 

256 

1.3 

9 

16.8 

- 7.8 

60.84 

81 

2.1 

44 

35.2 

8.8 

77.44 

1,936 

2.3 

38 

39.5 

- 1.5 

2.25 

1,444 

1.5 

28 

21,9 

6.1 

37.21 

784 

1.5 1 

23 

21.9 

1.1 

1.21 

529 

1.2 

18 

14.2 

3.8 

14.44 

324 

1.3 

22 

16.8 

5.2 

27.04 

484 

1.8 

18 

29.0 

-11.0 

121.00 

324 

3.5 

40 

54.0 

-14.0 

196.00 

1,600 

3.6 

65 

54.0 

11.0 

121.00 

4,225 

Hums 

429 

425.3 

+ 3.7 

718.51 

16,057 


The values at the foot of Table 33 now give the constants necessary 
to measure the closeness of the correlation. First the standard devia- 
tions of 7 and of z" are computed, using the formula 


Cy 


- 


- n(Ml) 


= 14.44 


_ ^:s(z"y - n(M^ ^ ^ 718.51 -^14(0.264^) 


= 7.16 


Then, by equation (22.2), 

^V./(x) = 8.07 


156 


PRACTICAL CORRELATION PROCEDURE 


Here 3 is used for the value of m, since it is judged that a parabolic 
equation of type (a) , with 3 constants, would be adequate to reproduce 
the freehand curve. 

The standard error of estimate for the graphic regression curve is 
thus 8.07 ten-pound units, or 80.7 pounds. This is 2.1 pounds smaller 
than the corresponding value in the case of the linear correlation, in- 
dicating how much more closely the curve fits the data than does the 
straight line, even after allowing for its greater flexibility. In Figure 
23 two dotted lines have been drawn in, each 80.7 pounds away from 
the regression curve, indicating the zone of estimate within which 
approximately two-thirds of the cases fall (10 out of 14 in this instance) 
and within which two-thirds of the actual yields may be expected to fall 
if new estimates of yield are made from the water applied for addi- 
tional cases drawn from the same universe. (Note also the discussion, 
in Chapters 18 and 19, of the reliability of such estimates.) 

The index of correlation, may next be computed by substituting 
the two standard deviations in formula (29) : 


- 1 - 


Pyx 



(29) 


This formula includes the corrections for the number of variables 
and constants. It should always be used in calculating the index of 
correlation where the curve has been determined freehand, as in this 
case, since it gives a more accurate measure of the correlation than 
does equation (23.2), shown previously. 

Where the equation of the curve has been determined by mathe- 
matical means, the standard error of estimate and the index of corre- 
lation may be computed without working out the estimates and 
residuals for each of the individual cases. These methods will be 
described subsequently.*^ 

In the example given, the index of correlation works out 


pyx 


= 1 - 


(7.16)2] r 14 - 1' 
(14.44)2 Jll4 _ 3 


= 1 - 0.2905 = 0.7095 


Pyj: = 0.842 


Since the index of determination is simply p^^., it is 71.0 per cent. 
Comparing these results with those obtained by linear correlation the 
index of determination of 71.0 per cent compares with the coefficient of 


^ See page 412, Chapter 22. 



INTERPRETING RESULTS OF CURVILINEAR CORRELATION 157 


determination of 69.4 per cent. Apparently taking into account the 
curvilinear nature of the relations has increased the proportion of the 
variance in yield accounted for by differences in water application by 
1.6 per cent of the total variance in the yield.® (Only the measures of 
determination can be directly compared in this way. If the coefficient 
of correlation, 0.833, were subtracted from the index of correlation, 
0.842, that would give an incorrect idea of the importance of taking 
account of the curvilinear nature of the relation.) 

Interpreting the results of curvilinear correlation. The index 
of determination and the accompanying standard error of estimate 
have been interpreted for the curve in much the same manner as were 
the coefficient of determination and the standard error of estimate for 
the straight line. In the case of the regression curve itself, however, 
a somewhat different method of presentation may be best, since a 
mathematical equation expressing the relation has not been computed. 

TABLE 34 

Yield op Pima Cotton, with Different Applications op Irrigation Water, on 
Maricopa Sandy Loam Soils in the Salt River Valley, Arizona, in 
1913, 1914, AND 1915 


Irrigation water 
applied 

Average yield of 
cotton lint 

Acre-feet 

Ponndfi per acre 

1.25 

150 

1.50 

222 

1.75 

283 

2.00 

336 

2.25 

385 

2.60 

431 


The regression curve just worked out for the cotton problem, for 
example, may be presented either as a curve showing graphically the 
yield to be expected for various applications of water, as is illus- 
trated in Figure 23, or as a table showing the same thing, as in Table 
34. In both instances the constant which has been determined from 
the average of 2 " is added to the values read from the curve in Figure 
23, f{X), so as to give the final estimates which would be made by 
taking into account this slight shift in the position of the curve. 

^ Sop Chai)t.or 18, page 319, for tests as to whether tins difforence is large enough 
to bo significant. 




158 


PRACTICAL CORRELATION PROCEDURE 


Similar presentation could be given the regression line in cases of 
linear correlation, if desired, but then the chart would show only a 
straight line and the table would show exactly the same changes in 
the dependent variable for each successive uniform change in the 
independent variable. In preparing the table, the relation is shown 
only for that range of water application within which the bulk of the 
observations fall. Similarly, only this range should be shown by the 
solid line in the chart; a dotted line might be used to indicate the 
relations beyond that up to the extremes observed. NTeither the re- 
gression line nor curve should, ordinarily, be carried beyond the 
limits of the observations on which it, was based. Also, before general 
conclusions are drawn as to the application of the results to cases 
other than those included in the sample (as, in this instance, to other 
fields in the same area) , the standard errors set forth in Chapters 18 
and 19 should be calculated and included in the interpretation. 

Summary. This chapter has illustrated the way in which corre- 
lation analysis may be applied to a specific problem, the manner in 
which linear and curvilinear regressions may be determined most 
simply, and the way in which they may be interpreted. In addition, 
the simplest manner of computing the standard error of estimate and 
the coefficient and the index of correlation have been illustrated, and 
their significance has been briefly discussed. 



CHAPTER 9 


THREE MEASURES OF CORRELATION— THE MEANING 
AND USE FOR EACH 

So many different statistical coefficients have been introduced in 
the discussion of correlation that there may be some confusion among 
them as to the meaning and use of the different coefficients. Par- 
ticularly in linear correlation, there are three constants which sum- 
marize nearly all that a correlation analysis reveals. 

First, the standard error of estimate shows how nearly the esti- 
mated values agree with the values actually observed for the variable 
being estimated. This coefficient is stated in the same units as the 
original dependent variable, and its size can be compared directly with 
those values. 

Second, the coefficient of determination (r2) shows what propor- 
tion of the variance in the values of the dependent variable can be 
explained by, or estimated from, the concomitant variation in the 
values of the independent variable.’- Since this coefficient is a ratio, it 
is a “pure number’^; that is, it is an arbitrary mathematical measure, 
whose values fall within a certain limited range, and it can be com- 
pared only with other constants like itself, derived from similar 
problems. 

Finally, the coefficient of regression measures the slope of the 
regression line; that is, it shows the average number of units increase 
or decrease in the dependent variable which occur with each increase 
of a specified unit in tho independent variable. Its exact size thus 
depends not only on the relation between the variables but also on the 
units in which each is stated. It can be reduced to another form, 
however, by stating each of the variables in units of their own indi- 
vidual standard deviation. In this form it has been termed p or 
the “beta’^ coefficient “ The relation between beta and the coefficient 

^ These stilt einenhs iire all subject to the error limitations set forth later, in 
Chapters 18 and 19. 

2 See Truman Kelley, Statistical Method, p. 282, The Macmillan Co., New York 
1924. 


169 



160 


THREE MEASURES OE CORRELATION 


of regression may be indicated by stating the regression equation in 
both ways: 

y = n “h hyxX 



Stated in this way, for the cotton^yield problem is 0.845. That 
is, for each increase of one standard deviation (0.73 acre-foot of 
water) in X, the yield of cotton increased 0.845 of one standard 
deviation. Since the standard deviation of Y was 144.3 pounds, that 
is equal to 121.9 pounds of cotton for each 0.73 acre-foot of water. 
This is at the rate of 167 pounds of cotton for each foot of water, 
which is the same thing as was shown by the coeflficient of regression. 
However, for comparisons between problems where the standard de- 
viations are much different, the '^beta” coefficient may have value. 
It is evident that in simple correlation the value of beta is the same 
as that of r. 

Relation of the different coefficients to each other. Even though 
each of the three coefficients measures certain aspects of the relation 
between variables, it does not follow that all three coefficients will 
vary together, or that a problem which shows a high coefficient of 
determination will also show a high regression coefficient or a low 
standard error of estimate. That is because they measure different 
aspects of the relation. 

The particular usefulness of each of the three different groups 
of correlation measures is illustrated in Figure 24, which shows three 
sets of simple relationships, with hypothetical data. 

Here the regression coefficient is smaller in A than B. In A an 
additional inch of rain causes an average increase of 2.5 bushels in 
yield, as compared with an increase of 3.1 bushels in B. But in case 

A, a considerable part of the variation in yield is apparently due to 
rainfall, as shown by the high correlation (r = 0.83) and the small 
size of the standard error of estimate (2.2 bushels) ; whereas in case 

B, factors other than rainfall apparently cause most of the differ- 



RELATION or DIFFERENT COEFFICIENTS TO EACH OTHER 161 


ences in yield, as indicated by the lower correlation (r = 0.71) and 
the larger standard error of estimate (3.8 bushels) . In terms of deter- 
mination apparently about 69 per cent of the differences in yield are 
related to differences in rainfall in the first case, and only about 50 
per cent in the second. 

In comparison with A and B, case C has much less variable yields, 
ranging from only about 8 bushels to 12 bushels, compared with a 
range of 8 to 21 in case A and 0 to 20 in case B. Only a small part 
(22 per cent) of the variation in yields is associated with rainfall 
differences, as indicated by the low correlation (0.47). An increase 
of 1 inch in rainfall apparently causes only 0.5 bushel increase in 
yield. Yet in spite of this low relation, it is possible to estimate yields 
more accurately, given the rainfall, in this case than in either of the 



6 8 10 6 8 10 6 8 10 
Rainfall Rainfall Rainfall 

Fig. 24. Hypothetical sets of data, illustrating three types of correlation 

coefficients. 

other two, as is shown by the standard error of estimate of 1.15 
bushels as compared to 2.2 bushels for A and 3.1 for B. The original 
variation in yields is so slight in case C that even the small relation 
shown to rainfall is enough to make it possible to estimate yields more 
accurately than in either of the other cases.^ 

These three cases illustrate the relative place of each of the three 
types of correlation measure. Case B shows the greatest change in 
yield for a given change in rainfall (the regression measure) ; case 
A shows the highest proportion of differences in yields accounted for 
by rainfall (the correlation or determination measure) ; and case C 
shows the greatest accuracy of estimate (the error of estimate meas- 

In calculating iho moasures for these illustrative cases, the corrections for 
numbers of cases have been ignored, as they would not have alTected the paHicular 
points these examples were set up to illustrate. 





162 


THREE MEASURES OF CORRELATION 


ure) . Which of these measures should have most attention in a par^ 
ticular investigation depends upon the phase of the investigation which 
is most important: amount of change (regression); the propor- 

tionate importance (correlation) ; or the accuracy of estimate (standard 
error) . All have their place, and none should be entirely overlooked 
or ignored. 



CHAPTER 10 


DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 
TWO OR MORE OTHER VARIABLES CHANGE: (1) BY 
SUCCESSIVE ELIMINATION 

The Problem of Multiple Relations 

The relations studied up to this point have all been of the type 
where the differences in one variable were considered as due to, or 
associated with, the differences in one other variable. But in many 
types of problems the differences in one variable may be due to a 
number of other variables, all acting at the same time. Thus the 
differences in the yield of corn from year to year are the combined 
result of differences in rainfall, temperature, winds, and sunshine, 
month by month or even week by week through the growing season. 
The premiums or discounts at which different lots of wheat sell on 
the same day vary with the protein content, the weight per bushel, 
the amount of dockage or foreign matter, and the moisture content. 
The speed with which a motorist will react to a dangerous situation 
may vary with his keenness of sight, his speed of nervous reaction, his 
intelligence, and his familiarity with such situations. The price at 
which sugar sells at wholesale may depend upon the production of 
that season, the carryover from the previous season, the general level 
of prices, and the prosperity of consumers. The weight of a child will 
vary with its age, height, and sex. The volume of a given weight of 
gas varies with the temperature and the barometric pressure. 

The physicist and the biologist use laboratory methods to deal 
with problenis of compound or multiple relationship. Under laboratory 
conditions all the variables except the one whose effect is being studied 
may be held constant, and the effect determined of differences in 
the one remaining varying factor upon the dependent variable, while 
effects of differences in the other variables are thus eliminated. In 
the case of a gas, for example, the temperature may be held constant 
while the volume at different barometric pressures is determined 
experimentally, and then the pressure held constant while the volume 
at different temperatures is determined. For many of the problems 

163 



164 MULTIPLE COERELATION BY SUCCESSIVE ELIMINATION 

with which the statistician has to deal, however, such laboratory 
controls cannot be used. Rainfall and temperature and sunshine vary 
constantly, and only their combined effect upon crop yields can be 
noted. Economic conditions are constantly shifting, and only the 
total result of all the factors in the existing situation can be measured 
at any time. And so on through many other types of multiple rela- 
tions similar to those mentioned — ^the statistician has to deal with facts 
arising from the complex world about him, and frequently has but 
little opportunity to utilize laboratory checks or artificial controls. 

Theoretical example. Where a dependent variable is influenced 
not only by a single independent variable, as in the relation of Y to -Y, 
but also by two or more independent variables, we can represent the 
relation symbolically by the equation 

Xi = a -jr ^ 2 X 2 + b^Xz + . . . bnXn (29.1) 

Here Xi represents the dependent variable, and X 2 , X^i • • . Xn 
represent the several independent variables. 

The meaning of the several constants in this equation and the way 
in which it may be interpreted geometrically can be shown by making 
up a simple example. 

Let us assume that in a new irrigation project the farms are all 
alike in quality of land and kinds of buildings and that the price at 
which each one is sold to the settlers is computed as follows: 

Buildings, $1,000 per farm 

Irrigated land, $100 per acre 

Range (non-irrigated) land, $20 per acre. 

Using Xi to represent the selling price per farm in dollars, Yo to 
represent the number of acres of irrigated land in each farm, and 
to represent the number of acres of range land, we can state the method 
of computing the selling price in the single equation 

Xi = 1,000 + 100X2 + 20X3 

The relations stated in this equation may be represented graphically 
as shown in Figure 24.1. The representation is broken up into lialves. 
The first half shows the relation of farm value to irrigated lan<l for 
farms that have no range land; the second shows the relation of farm 
value to range land for farms that have no irrigated land. This figure 
is constructed exactly the same as was Figure 9 on page 61. Thus in 
the upper section of Figure 24.1, each change of 1 unit in X^, as, for 



THEORETICAL EXAMPLE 


165 


example, from 3 to 4, adds 1 unit of b 2 f or $100, to the farm value. 
Similarly, in the lower section of Figure 24.1, each change of 1 unit in 
X 3 , as, for example, from 5 to 6 , adds 1 unit of 63 , or $20, to the farm 
value. In each case, as for zero acres, the line begins with the value 
of a, $ 1 , 000 , to cover the value of the buildings. 

The equation just shown (29.1) is called the multiple regression 
equation. The term multiple is added to indicate that it explains Xi 
in terms of two or more independent variables, X 2 , X 3 . . . Xn- The 

Farm va 

^ 2,000 
1,500 


1,000 

500 

0 

X, 

Farm 

value 

^1,500 


1.000 
500 
0 

JTj - Acres of ran^e land 

Tio. 24.1. Clraph of i.ho function 1,000 + lOOX;) -\- 20A’’;i. 

coefficients bo and 63 are termed net regression coefficients. The term 
net is added to indicate that they show the relation of to X^ and 
A" 3 , respectively, excluding, or net of^ the associated influences of the 
other independent variable or variables. In contradistinction, the re- 
gression coefficient bya oi equation ( 8 ) , 

Y = Cl -|- hyxX 

may be termed the gross regression coefficient. The term gross is added 
here to indicate that it shows the apparent, or gross, relation between 
and A without considering whether tliat relation is due to A alone, 
or to other independent variables associated witli A . 






166 MULTIPLE CORRELATION BY SUCCESSIVE ELIMINATION 

The difference between the net and gross regression coeflScients may 
be further shown by a simple arithmetic illustration, based on the farm- 
value formula just discussed. 

Let us take a dozen assumed irrigated farms and calculate from 
the pricing equation what their selling prices should be. In setting up 
these illustrative farms, let us assume further that in general the farms 
with large irrigated areas had small range areas and those with little 
irrigated land had larger amounts of range land. Under these condi- 
tions the computation works out as follows: 

TABLE 34.1 


Computation of Estimated Selling Price, with Zi = 1,000 + IOOX2 + 20^3 


Observation 

number 

Xi 

(1) 

Xz 

(2) 

ioo(j:2) 

(3) 

20(X3) 

(4) 

Calculated values of 
(3) + (4)+l,000 

1 

8 

5 

800 

100 

1,900 

2 

4 

5 

400 

100 

1,500 

3 

3 

10 

300 

200 

1,500 

4 

7 

8 

700 

160 

1,860 

5 

7 

10 

700 

200 

1,900 

6 

8 

15 

800 

300 

2,100 

7 

6 

12 

600 

240 

1,840 

8 

1 

15 

100 

300 

1,400 

9 

4 

17 

400 

340 

1,740 

10 

2 

22 

200 

440 

1,640 

11 

4 

20 

400 

400 

1,800 

12 

5 

13 

500 

260 

1,760 


The apparent relation of the values of ATi, as just computed, to X2 
and Z3 may be shown by preparing dot charts of the to X2 relation 
and the X^ to Z3 relation. These dot charts are shown in Figure 24.2. 

Examining this figure, we find that Xi is fairly closely related to 
X2 but that it has no definite relationship to Z3. We could calculate 
the regression lines for each of the two relationships shown. The re- 
gression coefiicient, bi2) for the first comparison, would show the 
average change in Zj with unit changes in Zg. The regression coeffi- 
cient, fei3, for the second comparison, would show the average change 
-^1 with unit changes in Z3. The latter coefficient would come very 
close to zero, to judge visually from the chart. Both these would be 
gross regression coefficients, measuring only the apparent relation be- 



THEORETICAL EXAMPLE 


167 


tween Xi and each of the other variables. We know in this case that 
the values of are completely determined by the values of X 2 and X3. 
If we could hold constant, or eliminate, the true effect of X2 on 
we should find that the relation of the corrected values of to X3 was 
just as close as to X2. In spite of the fact that the gross regression, 
613, appears to be zero, the net regression, 63, is really 20- 

By using the known net regression of Xi on X2, we can correct the 
Xi values to eliminate that part of their variation which is due to X 2 

J/ 

Farm value 



-Acres of irrigated land 



Xs - Acres of range land 


Fia. 245 . The apparent relation of farm value to acres of irrigated land and to 
range land reveals little of the underlying net relationship. 

and then relate the remaining fluctuation to X3. Let us do that by 
subtracting 62-^2 ^'’om Xj . This process is shown in Table 34.2. 

We can now plot the values of A’’^, corrected for X 2 , Xi — b-jXo, 
as shown in the sixth column, against the A3 value, as shown in the 
third column. The resulting dot chart is shown in Figure 24.3. 

This figure now shows the underlying relation between Aj and A3, 
with all the dots falling exactly on one straight line. If we now draw 
in the regression line and calculate its slope, we shall find it is exactly 
the same as the line for bs which was illustrated in the lower section of 
Figure 24.1. Figure 24.3 illustrates the net regression of A^ on A3, 
as contrasted to the gross regression which was represented by the 





168 MULTIPLE CORRELATION BY SUCCESSIVE ELIMINATION 


lower section of Figure 24.2. If were similarly corrected for 
and the values X^ — £> 3 X 3 were plotted against X 2 , the net regression 
of Xi on X 2 would similarly be shown. (This step is left for the 
student to perform.) 




1,500 


RELATION OF VALUE CORRECTED 
FOR IRRIGATED LAND 
"TO RANGE LAND 


1,250 h 


1,000 

5 10 15 20 

X3 - Acres of range land 

Fig. 24.3. After the net influence of irrigated land has been removed, the under- 
lying relation of farm value to acres of range land is very clear. 


If we had not known the underlying relationships as given in this 
case to start with, but merely had the series of observations of Xi, X^, 
and X 3 shown in Table 34.1 and Figure 24.2, would it be possible to 


TABLE 34.2 

COREECTTION OF COMPUTED Xi FOR CONTRIBUTION OF X2 


Observation 

number 

(1) 

X 2 

(2) 

:vr3 

(3) 

A', 

(4) 

bjA's 

(IOOA’^ 2 ) 

(5) 

Xl-b2X2 

(6) 

1 

8 

6 

1,900 

800 

1,100 

2 

4 

5 

1,500 

400 

1,100 

3 

3 

10 

1,500 

300 

1,200 

4 

7 

8 

1,8C0 

700 

1,160 

5 

7 

10 

1,900 

700 

1 ,200 

6 

8 

15 

2,100 

800 

1 ,300 

7 

6 

12 

1,840 

(>00 

1,240 

8 

1 

15 

1,400 

100 

1 ,300 

9 

4 

17 

1,740 

400 

1,340 

10 

2 

22 

1,640 

200 

1,440 

11 

4 

20 

1,800 

400 

1 ,400 

12 

5 

13 

1,760 

500 

1 ,260 


work out from those observations the underlying, or ricf,, relationshiiis? 
That is the problem which next will be explored. This time we shall 
use a series where we do not know the relationsliip, and see how we 




PRACTICAL EXAMPLE 


169 


can proceed to work it out. Also, as in most practical cases, we shall 
use an example where all the causes of variation are not known and 
where we must deal with independent variables which explain only 
a part of the variation in the dependent variable. 

Practical example. The problem of multiple relations is illustrated 
by the data in Table 35. These represent 20 farms in one area, with 
varying crop acreages, dairy cows, and incomes. To determine from 
these records what income may be expected, on the average, with a 
given size of farm and with a given number of cows, it is necessary to 
estimate the effect of differences in the number of acres on income 
and also the effect of differences in the number of cows on income. 

TABLE 36 

Acres, Number of Cows, and Incomes, for 20 Farms 


Record no. 

Size of farm 

Size of dairy 

Income 


Nuyriher of acre.*i 

Number of co ws 

Dollars per year 

1 

60 

18 

960 

2 

220 

0 

830 

3 

180 

14 

1,260 

4 

80 

6 

610 

5 

120 

1 

590 

6 

100 

9 

900 

7 

170 

6 

820 

8 

110 

12 

880 

9 

160 

7 

860 

10 

230 

2 

760 

11 

70 

17 

1,020 

12 

120 

15 

1,080 

13 

240 

7 

960 

14 

160 

0 

700 

15 

90 

12 

800 

16 

no 

16 

1,130 

17 

220 

2 

760 

18 

no 

6 

740 

19 

160 

12 

980 

20 

80 

15 

800 


From these data it would seem that botli the size of the farm and 
the size of the dairy herd influence farm income, to judge from dot 



170 MULTIPLE CORRELATION BY SUCCESSIVE ELIMINATION 


charts showing the relation of income to acres (Figure 25) and of 
income to number of cows (Figure 26) . It appears from these charts 


Income Income 




Fig. 25. Correlation chart of acres and 
income on individual farms. 


Fig. 26. Correlation chart of number of 
cows and income on individual farms. 


that there may be a slight tendency for the farms with the larger 
acreage in crops to have larger incomes and a rather marked tendency 

for the farms with the larger num- 
ber of cows to have larger incomes. 

Analysis by simple averages not 
adequate. The simple comparison 
alone, however, is not sufficient to 
tell exactly how incomes change 
with acres and with number of 
cows. That is because there is a 
marked relation between the size of 
the farms and the number of cows, 
as is illustrated in Figure 27. There 
is a definite tendency for the larger 
farms to have smaller dairy herds. 
As a result, the difference in in- 
comes in Figure 25, which appeared 
to be due directly to differences in acreages, may be due in part to the 
differences in the sizes of the dairy herds on the farms with different 
acreages in crops. If we make groups of farms of 50 to 99 acres, 100 
to 150 acres, and so on, and average the acres, cows, and income for 
each group, as is shown in Table 36, we find a marked difference in 
the number of cows from group to group, as well as in the number of 
acres and in the incomes. 


Acres 

200 


150 


100 


50 

0 5 10 J5 20 

Cows 

Fig. 27, Correlation chart of number 
of cows and number of acres on indi- 
vidual farms. 






PRACTICAL EXAMPLE 


171 


TABLE 36 

Avebagb Number op Cows and Income, for Farms of Different Sizes 


Size group 

Number 
of farms 

Average size 

Average size 
of dairy 

Average income 



Number of acres 

Number of cows 

Number of dollars 

50-99 acres 

5 

76 

13.6 

838 

100-149 acres 

6 

111 

9.8 

887 

150-199 acres 

5 

166 

7.8 

924 

200-249 acres 

4 

228 

2.8 

828 


The farms of 50 to 99 acres, with an average size of 76 acres, have 
incomes which average $838; the farms of 150 to 199 acres, with an 
average size of 166 acres, show incomes which average $924. Is this 
difference in income due to the difference in size? Before this can 
be definitely answered we must consider that the two groups also 
differ in the average number of cows, with 13.6 in the first group and 
only 7.8 in the second. So far, there is nothing to indicate whether 
the difference in income is due to the difference in the size of the 
farms or in the number of cows; we have shown that both vary from 
group to group, and that is all. 

If, on the other hand, we should attempt to determine how far 
income varied with differences in the number of cows by classifying 
the records with respect to the number of cows, and averaging incomes, 
we should secure the result shown in Table 37. 


TABLE 37 

Average Acres and Income, for Farms with Difitorent Numbers op Cows 


Size of herd 

Nurni^er 
of farms 

Average size 
of dairy 

Average size 
of farms 

Average income 



Number of cows 

Number of acres 

Number of dollars 

Under 5 cows 

5 

1.0 

190 

728 

5-9 cows 

6 

6.8 

143 

815 

10-14 cows 

4 

12.5 

135 

980 

15 cows and over 

5 

16.2 

88 

998 


Even though the income is higher on the farms with more cows, 
Table 37 does not indicate how much of that can be credited to the 
cows and how much to other factors. It is evident from the table 







17-J MULTIPLE CORRELATION BY SUCCESSIVE ELIMINATION 

that as the number of cows goes up, the number of acres goes down; 
are the differences in income associated with changes in number of 
cows, in number of acres, or in part with both? 

Eliminating the approximate influence of one variable. What we 
need to know is how far income varies with size of farm, as between 
farms with the same number of cows ; and how far income varies with 


TABLE 38 

Adjusting Farm Incomes for Differences in Number of Cows 


Size of farm 

Size of dairy 

Income 

Income assumed 
due to cows 

Income adjusted 
to no-cow basis 

Number of acres 

Number of cows 

Dollars 

Number of dollars 

Number of dollars 

60 

18 

960 

362 

598 

220 

0 

830 

0 

830 

180 

14 

1,260 

282 

978 

80 

6 

610 

121 

489 

120 

1 

590 

20 

570 

100 

9 

900 

181 

719 

170 

6 

820 

121 

699 

110 

12 

880 

241 

639 

160 

7 

860 

141 

719 

230 

2 

760 

40 

720 

70 

17 

1,020 

342 

678 

120 

15 

1,080 

302 

778 

240 

7 

960 

141 

819 

160 

0 

700 

0 

700 

90 

12 

800 

241 

559 

110 

16 

1,130 

322 

808 

220 

2 

760 

40 

720 

110 

6 

740 

121 

619 

160 

12 

980 

241 

739 

80 

16 

800 

302 

498 


the number of cows, as between farms of the same size as to acres. 
One way of determining this would be to adjust the income on each 
farm to eliminate the differences due to (or associated with) the 
number of cows, and then compare the adjusted incomes with the size 
of the farm to determine the effect of size on income. To start this 
process the effect of the number of cows upon incomes is needed. We 
can secure an approximate measure of this by determining the straight- 



ELIMINATING THE INFLUENCE OF ONE VARIABLE 173 


line equation for estimating incomes from cows — approximate only, 
since the differences in the size of the farms are ignored at this point. 

Determining the straight-line relation according to Chapter 5, we 
find that the relation between cows and income is given by the 
equation: 

Income = $694 + 20.11 (number of cows) 

According to this equation, farms with no cows averaged about 
$694 income, and these incomes increased $20.11 for each cow added, 
on the average. Knowing this relation, we can adjust the incomes^ 
on the several farms by deducting that part of the income which 
would be assumed due to the cows, according to this average relation. 

Table 38 illustrates the process of adjusting the incomes to a 
no-cow basis, by subtracting out this approximate effect of cows on 
incomes. The next step is to see what the relation is between the 
acres in the farm and these adjusted incomes. Plotting both on a 
dot chart. Figure 28 shows this relation graphically. Comparing this 
figure with Figure 25, where the relation between the acres and the 
unadjusted incomes was plotted, we see that the relation is much closer 
and more definite for the adjusted incomes than for the unadjusted 
incomes. This is only natural ; now that the marked relation of num- 
ber of cows to income has been removed, even if only approximately, 
the underlying relation of size to income can be more clearly seen. 

It is evident from Figure 28 that size has a more marked effect 
upon income than appeared in 
Figure 25, where the effect of fo 
cows was mixed in also. As was 
pointed out earlier, the fact that 
cows and acres were correlated 
meant that the effects of differ- 
ences in cows were mixed in with 
the effects of differences in acres. 

Now that the effect of cows has 
been at least roughly removed, 

the change in incomes with 
1 . 1 Fig. 28. Relation of income, nfljiLslod 

changes m acres can be more ac- . t r . i r 

. lor number of cows, to number of a(;r(‘M, 

curately determined. 

Fitting straight lines to the relations shown in Figures 25 and 28, 
to determine the average change in income with changes in acres, 
we obtain regression equations as follows: 

Income = $868.74 + (number of acres) $0.0234 

Income, effect of cows removed. =$508.51 + (number of acres) $1.33 


Income. adjusted 
no-cow Dosis 





174 MULTIPLE CORRELATION BY SUCCESSIVE ELIMINATION 

It is evident that the determination of the effect of acres upon 
income without making some allowance for the effect of the correlated 
variable, number of cows, in this case would have seriously under- 
estimated the effect of acres upon income. Such a determination 
would have shown only $0.02 increase in income for each acre increase 
in size, whereas the later determination shows $1.33 increase in income 
for each acre increase in size. 

The relation now shown between income and acres illustrates the 
extent to which one variable may really influence a second, even 
though its influence is concealed by the presence of a third variable. 
From Figure 25, which indicates that there is practically no correla- 
tion between acres and income, one might conclude that differences 
in income were not at all associated with differences in acreage; yet 
when the variation in income associated with cows is removed, even 
by the rough method shown, a very definite relation of income 
to size is found. For that reason one cannot conclude that, because 
two variables have no correlation, they are not associated with each 
other; the lack of correlation may be due to the compensating influ- 
ence of one or more other variables, concealing the hidden relation. 

Eliminating the approximate influence of both variables. We 
now have two equations, one showing the effect of cows upon income 
and the other the effect of acres : 

(A) Income = $694 + (number of cows) $20.11 

(B) Income, effect of cows removed, 

= $508.51 + (number of acres) $1.33 

These two equations can be combined into a single equation by 
taking that part of the first one which shows the increase in income 
for each cow and adding it to the second one. This gives an equation 
which includes allowances for both factors, as follows: 

(C) Income = $508.51 + (number of acres) $1.33 

+ (number of cows) $20.11 

The last equation gives a basis for indicating the effect of both 
acres and cows on income and for computing the income that might 
be expected, on the average, with a farm of a given size and witli a 
given number of cows. For example, for a farm of 120 acres and 
15 cows, the expected income would work out as follows: 

Income = $508.51 + (120) $1.33 + (15) $20.11 
= $508.51 + $159.60 + $301.65 = $970 


ELIMINATING THE INFLUENCE OF BOTH VARIABLES . 175 

If 5 cows were added, making it 120 acres and 20 cows, the esti- 
mated income would be: 

Income = S508.51 + (120) $1.33 + (20) $20.11 
- $1070 

Or if 50 acres were added, making 170 acres and 15 cows, the 
income would be estimated: 

Income = $508.51 + (170) $1.33 + (15) $20.11 

TABLE 39 


Actual Income and Income Estimated prom Number of Acres and Cows 


Acres 

Cows 

Computation of estimated income : 

Estimated 
income 
(A) + (C) 
+$508.51 

Actual 

income 

Actual 

income 

minus 

estimated 

income 

Estimate for acres 
$1.33 (acres) 

(A) 

Estimate for cows 
$20.11 (cows) 

(C) 

60 

18 

$ 80 

$362 


$ 960 

$ 9.5 

220 

0 

293 



830 

28.5 

180 

14 

239 

282 

1,029.5 

1,260 

230.5 

80 

6 

106 

121 

736.5 

610 

-125.6 

120 

1 

160 

20 

688.5 

590 

- 98.6 

100 

9 

133 

181 

822.5 

900 

77.5 

170 

6 

226 

121 

855.5 

820 

- 35.6 

110 

12 

146 

241 

895.5 

880 

- 15.5 

160 

7 

213 

141 

862.5 

860 

- 2.6 

230 

2 


40 

854.5 

760 

- 94.6 

70 

17 

93 


943.5 


76.5 

120 

15 

160 




109.5 

240 

7 

319 



960 

- 8.5 

160 

0 

213 

0 

721.5 

Ha 

- 21.5 

90 

12 

120 

241 

869.5 

■i 

- 69.6 

110 

16 

146 

322 

976.5 

1,130 

153.5 

220 

2 

293 

40 

841.5 

760 

- 81.5 

110 

6 

146 

121 

775.5 

740 

- 35.5 

160 

12 

213 

241 

962.5 


17.6 

80 

15 

106 

302 

916.5 

800 

-116.5 


Equation (C) can be used as illustrated, to work out what income 
might be expected, on the average, for each of the farms shown in 
Table 39. The estimated income can then be compared with the 
actual income and the difference, if any, determined. 







176 MULTIPLE CORRELATION BY SUCCESSIVE ELIMINATION 


As is illustrated in Table 39, the estimated incomes vary somewhat 
from the actual. This is just another way of saying that all the 
differences in income cannot be accounted for by the effect of differ- 
ences in acres and in cows, according to the relations summarized in 
equation (C). This failure of the estimated values to agree exactly 
with the original values is seen graphically in Figure 28 by the fact 
that all the dots do not lie exactly along the regression line. Sub- 
tracting the estimated values from the actual values gives the residual 
differences of the actual income above or below the income estimated 
from the two factors, acres and cows. 

Correcting results by successive elimination. It may now be 
recalled that, even though the incomes were adjusted to eliminate the 
effects of cows upon income before determining the relation between 
income and acres, the determination of the relation between income and 
cows was made without making any allowance for the concurrent effect 
of acres. Since we now have an approximate measure of the effect 
of acres determined while eliminating to some extent the effect of 
cows, we can use that new measure, equation (B), to adjust the in- 
comes for the effect of the acres and then get a more accurate measure 
of the true effect of cows alone upon incomes. This process is shown 
in Table 40. Here estimates of income are worked out by equation 
(B) on the basis of acres, showing what the incomes might be ex- 
pected to average if all the farms had no cows. The difference 
between these estimates and the actual incomes may then be con- 
sidered to be the part due to cows alone, while eliminating the effect 
of differences in the numbers of acres. On the first farm, for example, 
equation (B) indicates that with no cows the income for 60 acres 
should be $588. Subtracting this from the $960 actually received 
leaves $372 as the income apparently accompanying the 18 cows. 

The adjusted incomes may then be plotted on a dot chart with the 
number of cows as the other variable, as shown in Figure 29. Com- 
paring this figure with Figure 26, where the number of cows was 
plotted against income without first making any adjustment in the 
original incomes, we easily see how much closer the relation is after 
making the adjustment. Further, it is evident that cows have a 
greater effect upon income than was indicated by the earlier compari- 
son. Computing the straight-line relationship for Figure 29 gives 
the equation: 

(D) Income, adjusted to constant acres, 

= — $68.77 + (number of cows) $27.88 



CORRECTING RESULTS BY SUCCESSIVE ELIMINATION 177 


Income, 
fldj^ustedj 


By this last computation (equation [D]), each increase of one cow 
causes an average increase in income of $27.88, whereas according 
to the earlier comparison (equation [A]), each increase of one cow 
caused an average increase in income of only $20.11. The second 
value is larger than the first, again 
showing the necessity of making 
allowances for the effect of one 
factor before the true value of the 
other can be properly measured. 

Now that we have a new meas- 
ure of the effect of cows, we might 
go on to adjust incomes for cows 
by this new measure and then get 
a revised value for the effect of 
acres upon incomes on a no-cow 
basis, in place of the relation 
shown in equation (B). This pos- 
sibility of further correction will 
be referred to later. But before that we will make some experiments 
with the new equation (D). 

We now have equations for the relation of incomes, adjusted for 
the other factors, to the remaining factors. These two equations, (B) 
and (D), are: 



-ICO H 


“200 


5 10 

Cows 

Fig. 29. Relation of income, adjusted 
for number of acres, to number of cows. 


(B) Income, effect of cows removed, 

= $508.51 + (number of acres) $1.33 

(D) Income, adjusted to constant acres, 

= — $68.77 + (number of cows) $27.88 


These two equations may be combined to give a revised equation 
to indicate the effect of both cows and acres upon incomes, equa- 
tion (E). 

(E) Income = $439.74 + (number of acres) $1.33 

+ (number of cows) $27.88 


Equation (E) is exactly the same as the previous equation (C) 
except that the revised effect of cows is included, and the constant 
term has also been changed owing to changing the allowance for cows. 

In exactly the same way that equation (C) could be used to work 
out the estimated income for any given combination of cows and 



' 178 MULTIPLE COREELATION BY SUCCESSIVE ELIMINATION 

acres, equation (E) can be also used. Thus for 120 acres and 15 
cows, it would give 

Estimated income = $439,7 + (120) $1.33 + (15) $27.88 
= $439,7 + $159.6 + $418.2 = $1,018 

TABLE 40 


Adjusting Fabm Incomes fob Diffeeences in Number of Acres 


Size of farm 

Size of dairy 

Income 

Income estimated 
for acres, with 
no cows 

Income with 
effects of acreage 
differences 
eliminated * 

Number of acres 

Number of cows 

Dollars 

Number of dollars 

Number of dollars 

60 

18 

960 

688 

372 

220 

0 

830 

801 

29 

180 

14 

1,260 

748 

512 

80 

6 

610 

615 

- 5 

120 

1 

590 

669 

- 79 

100 

9 

900 

642 

258 

170 

6 

820 

735 

85 

110 

12 

880 

655 

225 

160 

7 

860 

722 

138 

230 

2 

760 

815 

- 55 

70 

17 

1,020 

602 

418 

120 

15 

1,080 

669 

411 

240 

7 

960 

828 

132 

160 

0 

700 

722 

- 22 

90 

12 

800 

629 

171 

110 

16 

1,130 

655 

475 

220 

2 

760 

802 

- 42 

110 

6 

740 

655 

85 

160 

12 

980 

722 

258 

80 

15 

800 : 

615 

185 


* Where the actual income ia below that expected for a farm of that size with no cows, the 
deficit is indicated by the minus sign. 


The result, $1,018, is $48 higher than the $970 worked out by 
equation (C). This higher estimate is due to the fact that equation 
(E) makes a larger allowance for the effect of each cow, and 15 is 
more than the average number of cows. If less than the average 
number of cows were used, equation (E) would give a lower estimate 
than equation (C). 









CORRECTING RESULTS BY SUCCESSIYE ELIMINATION 179 


Working out the estimated incomes for each of the original obser- 
vations according to equation (E), we obtain results as shown in 
Table 41. 


TABLE 41 

Actual Income and Income Estimated prom Number op Acres and Number op 
Cows, Revised Relations 


Acres 


Computation of estimated income 

Estimated 
income, 
(A) + (B) 
+$439.7 

Actual 

income 

Actual 

income 

minus 

estimated 

income 

Cows 

Estimate for acres 
$1.33 (acres) 

(A) 

Estimate for cows 
$27.88 (cows) 

(B) 

60 

18 

$ 80 

$502 

$1,021.7 

$ 960 

-$ 61.7 

220 

0 

293 

0 

732.7 

830 

97.3 

180 

14 

239 

390 

1,068.7 

1,260 

191.3 

80 

6 

106 

167 

712.7 

610 

-102.7 

120 

1 

160 

28 

627.7 

590 

- 37.7 

100 

9 

133 

251 

823.7 

900 

76.3 

170 

6 

226 

167 

832.7 

820 

- 12.7 

no 

12 

146 

335 

920.7 

880 

- 40.7 

160 

7 

213 

195 

847.7 

860 

12.3 

230 

2 

306 

56 

801.7 

760 

- 41.7 

70 

17 

93 

474 

1,006.7 

1,020 

13.3 

120 

15 

160 

418 

1,017.7 

1,080 

62.3 

240 

7 

319 

195 

953.7 

960 

6.3 

160 

0 

213 

0 

652.7 

700 

47.3 

90 

12 

120 

335 

894.7 

800 

- 94.7 

no 

16 

146 

446 

1,031.7 

1,130 

98.3 

220 

2 

293 

56 

788.7 

760 

- 28.7 

no 

6 

146 

167 

752.7 

740 

- 12.7 

160 

12 

213 

335 

987.7 

980 

- 7.7 

80 

15 

106 

418 

963.7 

800 

-163.7 


Comparing the residuals, or differences between the actual and 
estimated income, obtained by means of this new equation with those 
obtained using the equation in its first form (shown in Table 39) , we 
see that in more than half the cases they are smaller with the 
revised form. A more definite comparison can be made by comput- 
ing the standard deviation of the residuals in each case. The standard 
deviation of the residuals shown in Table 39, using equation (C), 



180 MULTIPLE COBRELATION BY SUCCESSIVE ELIMINATION 

is $90.29, whereas the standard deviation of the residuals shown in 
Table 41, using equation (E), is but $78.70. It is apparent from this 
that the revised equation, determined after the effects of the other 
variables had been eliminated, gives more accurate estimates of income 
than does the original equation in which the effects of tlie other vari- 
ables had not been so fully eliminated. 

It was suggested previously that the last corrected values for the 
relation of cows to income gave a new basis for correcting income 
so as to measure more accurately the relation of acres to income. 
This in turn would give a new basis for measuring the effect of cows, 
and so on, until a final stable value had been reached. So long as 
a new correction would result in a further change in the computed 
effect of either variable, the new values would give a better basis for 
estimating income than did the previous values. Only when the point 
was reached where no further change need be made in the effect of 
either variable could it be said that the relation of each variable to 
income had been quite correctly measured while allowing for the influ- 
ence of the other factor, and that might involve a large number of 
successive corrections. 

This method of allowing for the effect of other factors so as to 
determine the true relation of each one to the dependent factor (as 
income, in this case) , by first correcting for one, and then for another, 
is known as the method of successive elimination. This method can 
be used where there are three or more independent factors related to 
(or accompanying variations in) a dependent (or resultant) factor 
just as it was used here for two factors, except that then the depend- 
ent needs to be corrected in turn to eliminate the effects of all the 
other independent factors except tlie particular one whose effect is 
being measured. But although it is possible to measure the relations by 
this method, it would be a very slow and laborious process. A 
shorter mathematical method which gives the same result by more 
direct processes is available instead. This method, known as the 
method of multiple correlation, is presented in detail in Chapter 12. 

Summary. This chapter has shown that when two related fac- 
tors both affect a third factor it is difficult to measure the effect of 
either factor upon the third without the result being affected by both 
causal factors. Allowing for this duplication by eliminating the 
effects of each factor in turn (successive elimination) can gradually 
determine the true effect of each, but the method is long and laborious. 



CHAPTER 11 


DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 
TWO OR MORE OTHER VARIABLES CHANGE: (2) BY 
CROSS-CLASSIFICATION AND AVERAGES 

We have previously seen (Chapter 4) how the relation between 
two variables can be studied by means of averages. An extension 
of the same method can be used for problems where two or more vari- 
ables affect a third variable, such as that discussed in the last chapter. 

Analysis by averages where there are two independent variables 
involves classifying the records first by one variable, then breaking 
each of the resulting groups into several smaller groups according 
to the values of the second variable. If a third independent variable 
were to be considered, these groups would be broken up into still 
smaller groups, according to the values of the third variable. Then 
the values of the dependent variable, as well as each of the inde- 
pendent variables, would be averaged for each subgroup. This process 
is known as subclassification or cross-classification. 

Cross-classification for three variables. In the problem pre- 
sented in the last chapter, there were two independent variables — 
number of cows and number of acres. The records would therefore 
need to be classified into groups both according to the number of 
cows and the number of acres on each farm. Since there is such a 
small number of records the groups should not be made too small. 
Let us take three groups for cows; less than 6, 6 to 11, and 12 and 
over; and four groups for the size of farm; from 50 to 99 acres, from 
100 to 149, from 150 to 199, and 200 acres and over. This will give 
us twelve possible groups in all. The records may be classified into 
these twelve groups and totals and averages computed for each, as 
shown in detail in Table 42. 

It is apparent that none of these groups has a sufficient number of 
farms represented to make the averages juirticularly significant; yet 
even at that a certain regularity in the averages can be observed. In 
each column the average income increases as the size of farm increases, 
though there is but little difference in the average number of cows 

181 



182 


MULTIPLE RELATIONS BY CROSS-CLASSIFICATION 


from group to group; similarly across each line of averages the 
income increases as the number of cows increases, though there 
is but little difference in the average size of farm from group to 

TABLE 42 

Cross-classification of Reports According to Size of Farm and Size 

OP Dairy Herd 


Size of dairy herd 


Size of farm 

Under 6 cows 

6 to 11 cows 

12 cows and over 

Acres 

Cows 

Income 

Acres 

Cows 

Income 

Acres 

Cows 

Income 

60 to 99 acres 

1 

Total 

Nuntr 

her 

Num- 

ber 

Dollars 

Num- 

ber 

80 

Num- 

ber 

6 

Dollars 

610 

Num- 

ber 

60 

70 

90 

80 

Num- 

ber 

18 

17 

12 

16 

Dollars 

960 

1,020 

800 

800 





























' 62 
15.5 

* 3,580 
895 

Average 




80 

! 


610 





100 to 149 

acres 

Total 

120 

1 

590 


9 

6 

900 

740 


12 

15 

16 

880 

1,080 

1,130 













210 

106 

B 

1,640 

820 


43 

14.3 

3,090 

1,030 

Average 

120 

1 

590 

160 to 199 1 
acres | 

Total 





6 

7 

820 

860 

180 

160 

14 

12 

1,260 

980 







330 

166 

13 

6.5 

1,680 

840 

340 

170 

26 

13 

2,240 

1,120 

Average 

160 

0 

700 

200 acres andj 
over 1 

Total 

220 

230 

220 

0 

2 

2 

830 

760 

760 


7 

960 




670 

223 

4 

1.3 

2,360 

783 

240 

7 

960 




Average 














CROSS-CLASSIPICATION FOR THREE VARIABLES 


183 


group. These relations may be more clearly seen in Figures 30 and 
31, where the average incomes from Table 42 are charted, first for 
differences in the number of cows with farms of similar sizes, and 
then for differences in the number of acres, with farms of similar 
numbers of cows. 


Average 

Income 



Fig. 30. Difference in average income with difference in number of cows, for farms 

grouped by size of farm. 

Both figures show the tendency for income to increase with an 
increase in the independent variable, when the effect of the other 
variable is held fairly constant by the grouping process. In Figure 


Average 

Income 



Average number of acres 


Fig. 31. Difference in average income with difference in number of acres, for farms 
grouped by numbers of cows. 

30 the lines show about the same general slope for each of the four 
groups, though there are some irregularities. Figure 31 similarly 
shows about the same general change in income with a given change 
in the size of the farm, no matter what is the number of cows; but 
here the irregularities from group to group are even more striking. 





184 


MULTIPLE RELATIONS BY CROSS-CLASSIFICATION 


In Chapter 4 it was shown that such irregularities from group tc 
group might readily be due to random errors of sampling. In the 
present case, the number of items in each group is so small that it 
would be hardly worth while to compute the standard error for each 
average. Even if there were many more cases in each group than are 
available here, differences as large as those shown might be due 
simply to random differences in sampling and therefore have no real 
meaning as indicating differences prevailing in the universe from 
which the sample was selected. 

Although the averages obtained by the process of subsorting may 
be considered to show the general effect of changes in one variable, 
such as cows, upon income, with the effect of the other variable, such 
as acres, removed, they cannot be considered to show the specific 
effect of specific differences. For example, much more evidence 
would be needed to prove that, between 75 and 100 acres, a change 
of 1 acre has much greater effect upon income on farms with 6 to 11 
cows than on farms with 12 cows or more, even though the lines in 
Figure 31 would appear to indicate this. All that is really proved 
is that on farms of both numbers of cows there is a tendency for income 
to increase with an increase in the number of acres. 

TABLE 43 


Diffebence m Average Income for Farms of Different Sizes and 
With Different Sizes of Dairy Herd 


Size of farm 

[ Under 6 cows 
in herd 

6 to 11 cows 
in herd 

12 cows or over 
in herd 

Size of 
group 

Average 

income 

Size of 
group 

Average 

income 

Size of 
group 

Average 

income 


Number 

Dollars 

Number 

Dollars 

Number 

Dollars 

60 to 99 acres . . . 

of farms 

of farms 

1 

610 

of farms 

4 

895 

100 to 149 acres . . . 

1 

590 

2 

820 

3 

1,030 

150 to 199 acres . . . 

1 

700 

2 

840 

2 

1,120 

200 to 249 acres . . . 

3 

783 

1 

960 




The averages obtained by the process shown in Table 42 may be 
summarized for publication in a form similar to Table 43. The num- 
ber of cases represented in each average is included to prevent the 
reader from placing an undue amount of confidence in an average 



AVERAGE DIFFERENCES BETWEEN MATCHED SUB-GROUPS 185 


based on a small number of observations. In addition, each should be 
followed by ± its own standard error. 

The very small number of cases included in each of the groups 
is strikingly brought out in Table 43. Even if there were five times 
as many farms to deal with — 100 in all — if they were distributed in 
the same manner, the largest group would have only 20 cases, and 
all the rest would have 15 or less, which, under ordinary conditions, 
would be hardly enough for really significant averages. 

Average differences between matched sub-groups. After the ob- 
servations have been grouped and averaged as shown in Table 43, 
average differences in the dependent variable (as here, dollars of 
income), with given differences in each independent variable, can be 
roughly determined while holding constant the other independent 
variable or variables. This involves determining the average differences 
between the averages for the dependent variable for matched groups. 
The computations are shown in Tables 43.1 and 43.2. 

TABLE 43.1 


Change in Average Income between Groups Matched for Size op Farm 



A 

B 

C 

D 

E 

Size of farm 

Under 

6 to 

Increase 

Over 

Increase 


6 cows 

11 cows 

(.B-A) 

12 cows 

{D-B) 

Acres 

Dollars 

Dollars 

Dollars 

Dollars 

Dollars 

50-99 


610 


895 

285 

100-149 

590 

820 

230 

1,030 

210 

150-199 

700 

840 

140 

1,120 

280 

200-249 

783 

960 

177 



Average change with cows. 


1 

182 


258 


From these results it appears that increasing the number of cows 
from under 6 to between 6 and 11, without changing the size of farm, 
was accompanied by an average increase of $182. Increasing the 
cows further to over 12 cows was accompanied by a further increase 
of income of $258. Similarly, increasing tlie size of farm from under 
99 acres to 100-149 acres, without clianging the number of cows, was 
accompanied by an increase of $173 in income. A further increase to 
150-199 acres was accom])anied by a further average increase of $73 
in income, and to 200-249 acres, by $102 more income. (In this 



186 


MULTIPLE EELATIONS BY CROSS-CLASSIFICATION 


discussion ^^increase^’ in size or cows has been used to designate dif- 
ferences between results for farms of different sizes or with different 
number of cows.) These rough measurements of differences in the de- 
pendent variable with differences in one independent variable, while 
holding a second independent constant by subsorting, may be compared 
with results obtained by the more exact methods set forth in subse- 
quent chapters.^ 

This same method may be applied to get the average difference 
between matched subgroups, where two or more other independent 
variables are held constant by the grouping. 


TABLE 43.2 

Change in Average Income between Groups Matched for Number of Cows 


Number 
of cows 

A 

50-99 

acres 

B 

100-149 

acres 

C 

Increase 

(B^A) 

D 

150-199 

acres 

B 

Increase 

(.D-B) 

F 

200-249 

acres 

a 

Increase 

iF-D) 

Under 0 

Dollars 

Dollars 

590 

820 

1,030 

Dollars 

210 

135 

Dollars 

700 

840 

1,120 

Dollars 

110 

20 

90 

Dollars 

783 

960 

Dollars 

83 

120 

6 to 11 

610 

895 

12 or over 




Average change 
with acres — 



173 


73 


102 





Limitation of cross-classification for many variables. This small 
problem illustrates one fundamental difficulty with the method of 
subclassification and averaging — ^the large number of cases required 
for conclusive results. Though there are only two independent 
variables involved, and the records are classified into only three groups 
one way and four the other, apparently 100 cases or more would be 
required for really significant results. If it had been desired to sub- 
classify the records according to two more additional variables — say 
number of men employed and number of hogs kept — that would have 
greatly increased the number of records necessary. If each of the 

^In computing Tables 43.1 and 43.2, no attention was paid to weighting the 
results according to the number of cases falling in each group, or to the sampling 
reliability of each average. For a discussion of the first of these points, and for 
possible methods of dealing with it, see F. A. Harper, Analyzing data for relation- 
ships, Cornell University Agricultural Experiment Station Memoir 231, June, 1940. 




LIMITATION OF CROSS-CLASSIFICATION FOR MANY VARIABLES 187 


TABLE 44 


Form fob Showing Differences in Average Income for Farms Classified 
BY Acres, Men Employed, Cows, and Hogs 



1 man 

2 men 

3 men 


Size * 

Average 

income 

Size * 

Average 

income 

Size * 

Average 

income 


Under 6 cows 

Farms of 50 to 99 acres: 

Under 20 hogs 

20-39 hogs 

40 hogs and over 

Farms of 100 to 149 acres: 

Under 20 hogs 

20-39 hogs 

40 hogs and over 








C to 11 cows 

Farms of 60 to 99 acres: 

Under 20 hogs 

20-39 hogs 

40 hogs and over 

Farms of 100 to 149 acres: 

Under 20 hogs 

20-39 hogs 

40 hogs and over 








12 cows and over 

Farms of 50 to 99 acres: 

Under 20 hogs 

20-39 hogs 

40 hogs and over 

Farms of 100 to 149 acres: 

Under 20 hogs 

20-39 hogs 

40 hogs and over 








Etc. 


* Number of reporta in Rroiip. 





188 MULTIPLE EELATIONS BY CROSS-CLASSIFICATION 

groups already shown had been further divided into 1-man, 2-man, 
and 3-or-more-man farms, and each of these sub-groups had been fur- 
ther divided into farms with less than 20 hogs, 20 to 39 hogs, and 
40 or more hogs, that would have increased the number of possible 
groups from 12 to 108. Where over 100 records would have been 
needed in the first case to give results at all reliable, probably a thou- 
sand or more records would be needed with this further classification. 
Although such large numbers of records are available in some types 
of work, as in census tabulations, they are rarely obtainable in most 
economic or social-science studies, and for that reason treatment of a 
large number of variables by the method of detailed sub-classification 
has but limited application in this field. 

The way in which a fourfold classification, such as that described 
in the preceding paragraph, might be presented is indicated by the 
form in Table 44, even though it would only occasionally be used. 

In addition to the large number of cases required to obtain reliable 
results, the method of sub-classification and averaging has further 
shortcomings; it provides no measure of how important the relation 
shown is as a cause of variation in the factor being studied, or of how 
closely that factor may be estimated from the others on the basis of 
the relations shown. Thus Table 43 shows that, on the average, certain 
differences in the number of cows and in the number of acres were 
accompanied by certain differences in the average income. By itself, 
how^ever, it did not give any indication of how closely the income 
could be estimated if the number of acres or the number of cows 
were known; nor did it indicate the proportion of the variance in 
income which can be explained by concurrent differences in size of 
farm and size of dairy. For these reasons, as well as because of 
the large number of cases necessary to obtain reliable conclusions, 
the method of sub-classification and averaging does not determine 
the relationships where many variables are involved so satisfac- 
torily as do other methods, which will be considered in subscejuent 
chapters. 

Significance of differences in group averages. When the data are 
classified as shown above, the results may be tested to determine 
whether the differences found between successive group averages are 
significant, or whether they miglit have occurred by cliance. One 
method for testing this is to compute the standard error for each groii]) 
average and to consider these standard errors in judging whether or 



SUMMARY 


189 


not the differences are significant.^ A second method of judging the 
significance of the differences is by determining whether the variation 
between the averages of the columns or cells is or is not significant, as 
compared to the variation between the individual items which fall in 
each column or cell. Relatively simple methods, set forth in standard 
textbooks,® are available for this “analysis of variance.’^ Since these 
methods relate only to the significance of the observed differences, 
and not to the functional nature of the relations which underlie those 
differences, they are not presented here. 

Summary. The relation of one variable to several others may be 
approximately determined by detailed cross-classification. Very large 
numbers of records are required to make the averages accurate, how- 
ever, since the number of groups increases rapidly with additional 
variables. Further, the averages by themselves give no indication of 
the closeness of correlation. 

2 Formulas for the standard errors of the difference between two group averages 
are given by G. Udny Yule and M. G. Kendall in their Introduction to the Theory 
of Statistics (eleventh edition), pp. 387-88, C. Griffin and Co., Ltd., London, 1937. 

^ Frederick E. Croxton and Dudley J. Cowden, Ap'plied General Statistics, 
pp. 351-59, Prentice-Hall, Inc., New York, 1939. 

R. A. Fisher, Statistical Methods for Research Workers (seventh edition). 
Chapter VIII, Oliver and Boyd, London and Edinburgh, 1938. 

G. W. Snedecor, Statistical Methods Applied to Expenments in Agriculture aud 
Biology, Chapters 10, 11, Iowa State College Press, Ames, Iowa, 1937. 



CHAPTER 12 


DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 
TWO OR MORE VARIABLES CHANGE: (3) BY USING 
A LINEAR REGRESSION EQUATION 

In Chapter 10 it was shown that an equation could be arrived at to 
express the average relation between income, acres, and cows, as fol- 
lows: 

Equation (E) 

Income = 439.74 + 1.33 (number of acres) + 27.88 (number of cows) 

If we designate the three series of variable quantities, income, 
acres, and cows, by the symbol X with dijfferent subscripts, using Xi to 
represent dollars of income, X 2 to represent number of acres, and A 3 to 
represent the number of cows, we can rewrite the equation in the form 

Xi = 439.74 + 1,33X2 + 27.88X3 

If now we use the symbol a to represent the constant quantity 
439.74; £>2 represent 1.33, the amount which Xi increases for each 
increase of one unit in X 2 (one acre) ; and £>3 to represent 27.88, the 
amount which Xi increases for each increase of one unit in X '3 (one 
cow) ; the equation appears as 

Xi = a + ^ 2 X 2 + £ 3 X 3 (30) 

Comparing this equation with the regression equation for the 
straight-line relation between two variables 

F = a + £X 

we see that the two equations are just alike, except for the difference 
in the symbols used to represent the different variables and for our 
having added the expression for an additional variable. In equation 
(30), Xi, the variable which is being estimated, is termed the de- 
pendent variable, since its estimated value depends upon those of 
the other variable or variables; and Xg and X 3 are termed independent 
variables, since their values are taken just as observed, independent 

190 



EQUATION TOR TWO INDEPENDENT VARIABLES 


191 


of any of the conditions of the problem. Since there is more than one 
independent variable concerned, the equation is said to be a multiple 
estimating equation, or a multiple linear regression equation. 

Chapter 10 showed that the values of the constants a, £>2, and £>3, 
which in the particular problem considered indicate what the average 
income would be for a farm and dairy of any given size, could be 
worked out by a cut-and-try method which gradually approached 
nearer and nearer to the right values. It is evident, however, that for 
any particular criterion of “rightness^^ only one set of values for 
these constants can be exactly right. If the criterion of “rightness” 
is taken as that which will make the standard deviation of the residuals, 
when income is estimated from the other two variables, as small as 
possible, the values of a, £>2? ^ind ^3 which will give this result can 
be determined once and for all by a direct mathematical process. 
Determining these values so as to give the “best” equation for esti- 
mating Xi on the basis of linear relations to X2 and X^ is the first step 
in the method of linear multiple correlation. 

Determining a regression equation for two independent variables. 
The best values for a, b 2 and 63 in the multiple regression equa- 
tion (30), can be worked out by an extension of the same process 
used in working out the values for the estimating equation when only 
one independent variable was considered. Just as before, the value of 
the b constants will be determined first, equation (31), and then the 
a values will be worked out from them: ^ 

S (0:2)62 + 2(0:2^3)63 = 2(xio:2) 

2( x 2 o :3)62 + 2(0:3)63 = 2(0:10:3) 

a = Ml — h2M2 — b^M^ (32) 

Here, just as in Chapter 5, the symbol M represents the mean value 
of each variable, and the subscript indicates the particular variable. 

Similarly, the symbols ^(.Ty^s), 2(0:10:2), and 'Z{xiX^) represent 
the sums of the products of the variables, corrected to adjust them to 
deviations from the mean; that is, 2(.rio:2) — S[(Xi — M]){X2 
M2)]. Likewise the symbols 2(o:^), etc., represent the sums of the 
squares of the variables, also adjusted to deviations from the mean, 

^ See Note 6, Appendix 2, for the derivations of these equations. They are the 
normal equations for two independent variables, corresponding to the normal equa- 
tions for one independent variable given on page 67, in the footnote. 




192 


MULTIPLE LINEAR REGRESSION 


Using the two basic formulas 

2;(xia:2) = S(XiX2) - nMiMz (11) 

and 

SCrci) = Z(Xi) - n(Ml) 

the other values shown in equation (31) may be worked out as follows: 
X(xixs) = SCXiXg) - nMiMs 
'L(x2X^ = S(X2X3) — nM2M^ 
s(4) = S(X|) - n(Ml) 

Computing the extensions. Inspection of these equations shows that 
there are eight arithmetic values which must be computed from the 
original data to work out the values to substitute in equations (31) and 
(32). These are SXg, 2X3, S(X2), 2(X|), 2(XiXo), 

S(XiX3), and 2(X2X3). The actual work of computing these values 
for the farm-income data originally presented in Table 35 is shown in 
Table 45. [The value 2(X^) is not needed in solving equations (31) 
or (32) ; but, as it will be needed later, it is also worked out here for 
convenience in calculation.] 

After we have multiplied through all the extensions shown in this 
table, and added each of the columns, our next step is to compute the 
values Af2; ^^d by dividing the sums of each of the first three 
columns by the number of cases. The correction values for each of the 
products is then computed and entered below the value from which it is 
to be subtracted. Thus the value below the sum of the fourth column, 
2(X|), is its correction factor, n(M|). This is equal to 20(13.05)-, 
or 3892.05, which is the value entered. Similarly, the value below the 
sum of the fifth column, 2(X2X3), is its correction factor ^(il/oMii) , 
or 20(8.85) (13.95) , which equals 2469.15. All the other correction 
factors are similarly worked out and entered. Then subtracting each 
correction factor from the value above it gives the values all ready for 
equations (31) . Thus the value at the foot of column 4 is the value for 
2(a;^); and so on. When these values are substituted in the appro- 
priate spaces of equations (31), they become 

(I) 'Z{xl)b2+'2(x2X3)h3 = '^xiX2\ J 606.95 &2-394.15 63 = 14.20 

(II) 2 (x2a:3)52 +2(4)53 =2x1x3] (-394.15 52+676.55 53 = 1360.60 

Solving the equations. The next step is to solve the two algebraic 
equations simultaneously to determine the values for and 63. 



EQUATION FOR TWO INDEPENDENT VARIABLES 


193 


The simplest way to carry this through is by the Doolittle method. 
The first equation is divided through by the coefficient of with the 
sign changed, giving the first derived equation (!') : 

(I) 606.95 b 2 - 394.15 63 = 14.20 

(!') - 62 + 0.64939 63 = - 0.02340 


TABLE 45 

Computation of Values to Determine Multiple Regression Equation 
TO Estimate One Variable from Two Others 



1 

Number 

of 

acres* 

2 

N umber 
of 

cows 

X 3 

3 

Number 

of 

dollars 

income* 

Xi 

4 

Xl 

5 

X 2 X, 

6 

XiXi 

7 

Xl 

8 

XiX, 

9 

XI 


6 

18 

96 

36 

108 

676 

324 

1,728 

9,216 


22 

0 

83 

484 

0 

1,826 


0 

6,889 


18 

14 

126 

324 

252 

2,268 

196 

1,764 

15,870 


8 

6 

61 

64 

48 

488 

36 

366 

3,721 


12 

1 

69 

144 

12 

708 

1 

59 

3,481 


10 

9 

90 

100 

90 

900 

81 

810 

8,100 


17 

6 

82 

289 

102 

1,394 

36 

492 

6,724 


11 

12 

88 

121 

132 

968 

144 

1,056 

7,744 


16 

7 

86 

250 

112 

1,376 

49 

602 

7,396 


23 

2 

76 

529 

46 

1,748 

4 

152 

5,776 


7 

17 

102 

49 

119 

714 

289 

1,734 

10,404 


12 

15 

108 

144 

180 


225 

1,620 

11,664 


24 

7 

00 

676 

168 

2, .304 

49 

672 

0,216 


16 

0 

70 

256 

0 

1,120 

0 

0 

4,900 


9 

12 

80 

81 

108 

720 

144 

960 



11 

16 

113 

121 

176 

1,243 

256 

1,808 

12,769 


22 

2 

70 

484 

44 

1,672 

4 

152 

6,776 


11 

6 

74 

121 

66 

814 

36 

444 

5,476 


10 

12 

98 

256 

102 

1,568 

144 

1,176 

9,604 


8 

15 

80 

64 

120 

640 

225 

1,200 

0,400 

Sums . , . 

279 

177 

1,744 

4,499 

2,075 

24, ,343 

2,243 

16,795 

157,532 

Means . . 
Correctio 

13.95 
n item. . . 

8.85 

87.2 

3,892.05 


24,328.80 

1,506.45 

15,434.40 

152,076.80 

Oonocted sums . . . 



606.95 

- 394.15 

14.20 

076.55 

1,360.00 

5,455.20 


* In these computations, A "2 and Xi have been divided by 10. (See Note 3, Appendix 2.) 


Then equation (II) is entered, and under it is written equation (I) 
multiplied by the coefficient of 63 in equation (I') (0.64939). The sum 
of these two equations is then taken, eliminating the values in £> 2 : 












194 


MULTIPLE LINEAR REGRESSION 


ai) 

(0.64939) (I) 

(SII) 
(HO 


-394.15 62 + 676.65 63 = 1360.60 
+394.15 62 - 255.96 63 = 9.22 

420.59 63 = 1369.82 

53 = 3.25690 


As indicated above, this step gives the value of 63. This is then 
substituted in equation (!') and the value of 62 determined: 


-h2 + 0.64939(3.25690) = - 0.02340 

62 = 0.02340 + 2.11500 = 2.13840 

The values of and 63 being thus obtained, the next step is to sub- 
stitute them, together with the other values required, in equation (32) 
to work out the value for a: 


d = 

a = 87.2 - (2.1384) (13.95) - (3.2569) (8.85) 

= 87.2 - 29.83 - 28.82 = 28.55 

Estimating Xi from and X^- Having computed the values for 
a, 621 ^3i can now write out our regression equation (30) , with 

the best values, as determined by the mathematical calculation: 

(w) “ ifo) + 


Xi = 285.5 + 2.1384X2 + 32.569X3 

Comparing this equation with the last one obtained in Chapter 10, 
(page 178), we see that the mathematical determination has changed 
the $1.33 allowed for the effect of each acre (62) to $2.14, and in- 
creased the $27.88 allowed for the effect of each cow (£>3) to $32.57. 
Just what effect this has on the accuracy of the equation as a basis 
for estimating income from cows and acres may be judged by working 
out an estimated income for each of the 20 cases according to these 
last results, and then comparing the estimated values with the original 
values, just as was done before with the equations worked out by the 
approximation method. The necessary computation is shown in Table 40. 

The operations that have been performed in this table may be 
mathematically stated as follows: 

First, an estimated value of income, has been worked out by 
substituting in equation (30) the values for X 2 and X3 given by each 



EQUATION FOR TWO INDEPENDENT VARIABLES 


195 


successive observation. Using the symbol X[ to represent this esti- 
mated value of Xi it may be defined 


Xi = a 62^2 “h (33) 

Each estimated income has next been subtracted from the cor- 
responding actual income. With the symbol z used to represent the 
residual, the amount by which the actual value exceeds or falls below 
the estimated value, it may be defined 

2 - Xi - x; (34) 

The residual z has exactly the same meaning when the estimated 
values of the dependent variable are based upon two or more vari- 
ables, using multiple correlation, as it had previously when the esti- 
mate was based on a single variable, with simple correlation. 

The accuracy of the last estimating equation, derived by an exact 
mathematical process, can now be compared with the accuracy of 
previous equations, obtained by a cut-and-try process. Computing 
the standard deviation of the residuals shown in this last table and 
comparing it with the standard deviations of the residuals worked 
out in Tables 39 and 41 of Chapter 10, we find the comparison to be: 

Standard deviations of residuals using various straight-line equa- 
tions : 

First approximation equation, = 90.29 

Second approximation equation, = 78.70 

Mathematically determined equation, = 70.48 

The equation determined mathematically gives a closer estimate 
of the actual incomes from which it was derived than do either 
of the two previous equations. This wiW always hold true. The mathe- 
matically determined equation gives once and for all the estimates of 
Xi which will make cr- the smallest that can be obtained, assuming 
linear relations. The best that could be done by the approximation 
method would be to obtain the same conclusions as would be obtained 
by the other method. The successive steps in Chapter 10 have shown 
how difficult it is to do this when the several independent variables are 
correlated with each other, and so tend to vary with one another. The 
mathematical method for determining the estimating equation, as illus- 
trated in this Chapter (or some alternative form of computation involv- 
ing the same principle), has therefore been practically universally 



196 


MULTIPLE LINEAR REGRESSION 


adopted as the standard way of determining the precise way in which 
one variable is related to, or may be estimated from, two or more vari- 
ables related among themselves, if only straight-line relations are to 
be assumed. 


TABLE 46 

Actual Income and Income Estimated from Number of Acres and Cows, 
ON Basis or Mathematically Determined Relations 


Acres, 

Cows, 

^3 

Computation of estimated 
incomes 

Estimated 

income, 

Actual 

income, 

Xi 

Actual 

minus 

estimated 

income, 

Ai-Al 

z 

Estimated 
for acres, 
62 X 2 

Estimated 
for cows, 
hzXz 

Constant, 

a 

■1 


128 

586 

286 

1,000 

960 

- 40 



470 


286 

756 

830 

74 

■II 


385 

456 

286 

1,127 

1,260 

133 

80 


171 

195 

286 

652 

610 

-42 

120 


267 

33 

286 

676 

590 

14 


9 

214 

293 

286 

793 

900 

107 


6 

363 

195 

286 

844 

820 

-24 


12 

235 

391 

286 

912 

880 

-32 

160 

7 

342 

228 

286 

856 

860 

4 

230 

2 

492 

65 

286 

843 

760 

-83 

70 

17 


554 

286 

990 

1,020 

30 

120 

15 

257 

489 

286 

1,032 

1,080 

48 

240 

7 

513 

228 

286 

1,027 

960 

-67 

160 

0 

342 



286 

028 

700 

72 

90 

12 

192 

391 

286 

869 

800 

-69 


16 

235 


286 

1,042 

1,130 

88 

220 

2 



286 

821 

760 

-61 

no 

6 

235 


286 

716 

740 

24 

160 

12 

342 

391 

286 

1,019 

980 

-39 

80 

15 

171 

489 

286 

94() 

800 

-146 


Nomenclature in multiple linear correlation. When the (‘onstants 
of the estimating equation are determined by the exact mathematical 
process, the equation is called a multiple regression equation^ and the 
constants 62 which show, in this case, the average increase 

in income (Xi) for unit increases in acres (X2), and cows (X3), are 








NOMENCLATURE IN MULTIPLE LINEAR CORRELATION 197 


termed net regression coefficients. The constant 62 is termed “the net 
regression of on X2, holding X3 constant/^ and 63 is termed “the 
net regression of Xi on X3, holding X2 constant,” All that that means 
for b2j for example, is “the average change observed in Xi with unit 
changes in X2, determined while simultaneously eliminating from Xi 
any variation accompanying (hence temporarily assumed due to) 
changes in X3.” ^ 

In order that the mathematical notation for the net regression co- 
efficients may show quite clearly which independent variables were held 
constant when a particular coefficient was determined, the subscripts 
under the b are sometimes more elaborate, showing first the dependent 
variable, then the independent variable whose effect is stated, then a 
period followed by the independent variables which were held constant 
in the process. Thus the b^ we have been using would be written 
j)i2 3. The whole regression equation would appear 

Xi = ^ 1.23 + ^12.3X2 + ^ 1 S. 2 ^Z ( 35 ) 

This notation serves to distinguish these net regression coefficients 
from those which would be obtained if additional independent variables 
were included. Thus if a third independent variable, say X4, were also 
considered, the equation would read 

Xi = ai .234 + 512.34X2 + 513.24X3 + 514.23X4 ( 36 ) 

For still another variable it would be 

Xi = ai .2346 + 512.345X2 + 513.245X3 + 514.235X^4 + 515.234X5 ( 37 ) 

The notation for a is changed as well as for each of the b’s; ai.234 
will probably be a different value from ai.23, just as 5i2.34 is likely to 
be somewhat different from bi2.3. This is to be expected; if some 
other factor, such as the number of men working on each farm, were 
taken into account as well as the number of acres and the number of 
cows, the average increase in income per additional acre, with both the 
number of cows and the number of men held constant, might be quite 
different from what it would be with only the number of cows held con- 
stant. In the last case, any increase in income owing to more men 
being at work on the larger number of acres would be ascribed to the 
acres and not to the men, whereas in the former this clement would be 
removed from the increase attributed to the acres. 

2 The term 'partial reorcssion coefficient is iis(‘d by some aiitliors in pluoe of ni t 
regrcsfiion coefji cie n t . 



198 


MULTIPLE LINEAR REGRESSION 


Determining a regression equation for three independent variables. 
Solely to illustrate the method, we may take the number of men on 
each of these 20 farms as given in Table 47 and work out an estimating 
equation considering men as well as acres and cows. (In actual 
practice, 20 observations are usually too few to determine, with any 
degree of reliability, the net relation of one variable to 3 independent 
variables. This problem is used here solely to illustrate the process.) 

With the number of men designated as X4, the unknown constants 
to be determined are those given in equation (36) ; ai.234, £>12.34 
^^13.24» £^14.23- They can be obtained by the solution of the follow- 

ing set of equations. 


2(0:2)612.34 +2(2:20:3)613.24 +2(0:20:4)614.23 = 2(0:10:2) 

2(0:20:3)612.34 + 2(0:3)613.24 + 2(0:30:4)614.23 = 2(0:10:3) 

2(0:20:4)612.34 + 2(0:30:4)613.24 + 2(0:4)614.23 = 2(0:10:4) 

^1.234 = ^1 612.34-^2 ■“ 613.24-^1^3 "" 614.23M4 


(38) 

(39) 


Computing the extensions. All except 4 of the arithmetic values for 
equation (38) which need to be calculated from the original data have 
been worked out previously. Only the values which involve X4, and its 
mean, are additional. The new values needed are therefore M4, 
2(0:10:4), 2(0:20:4), 2 (x3o: 4), and 2(o:|). The computation of these values 
is shown in Table 47. 

AU the calculations, including correcting for the means at the end, 
are carried out just as in Table 46. The figures at the foot of each 
column provide the remaining values necessary to write out equations 
(38) in full. For convenience in writing these equations, we shall again 
use the abridged notation of 62 for 612.34, h for 613.24, etc., remembering, 
however, that 62 here is a different constant from 62 previously. 


(I) ' 2 {xl)b 2 + - 2 (X 2 X 3 )h 

+ S ix 2 Xi)hi = S (a:ia;2) 

(II) ■ 2 (X 2 X 3 )b 2 +- 2 (xl)i 3 

+ I,{X 3 Xi)b 4 = 'S{XlX 3 ) 

(III) h{X 2 Xi)b 2 +'Z{X 3 X 4 )b 3 

-fS(x|)b4 =l^{XiXd 


6O6.9562- 394. 1563 

+63.2064 = 14.20 

-394.1562+676.5563 

+11.6064 = 1360.60 

63.2O62+II.6O63 

+17.2064 = 193.20 


SoMng the equations. The three equations are now to he solved 
simultaneously to determine the values for 62, 63, and 64. This can he 
done by the usual algebraic processes, but the peculiar symmetrical 



EQUATION FOR THREE INDEPENDENT VARIABLES 199 


character of the equations, which the attentive reader has probably 
already noticed, makes it possible to use a much shorter method. Since 
the saving in clerical labor by the use of this method is quite significant, 
it will be shown in full. 

TABLE 47 

Computation op Additional Values to Determine Multiple 'Regression 
Equation, Adding a Third Independent Factor 


Item 

number 

Number 
of acres. 

X 2 * 

Number 
of cows, 

Xz 

Number 
of men, 

X 4 

Number 

dollars 

income, 

Xi* 

X 2 V 4 



XI 

1 

6 

18 

2 

96 

12 

36 

192 

4 

2 

22 

0 

3 

83 

66 

0 

249 

9 

3 

18 

14 

4 

126 

72 

56 

504 

16 

4 

8 

6 

1 

61 

8 

6 

61 

1 

6 

12 

1 

1 

59 

12 

1 

59 

1 

6 

10 

9 

1 

90 

10 

9 

90 

1 

7 

17 

6 

3 

82 

61 

18 

246 

9 

8 

11 

12 

2 

88 

22 

24 

176 

4 

9 

16 

7 

2 

86 

32 

14 

172 

4 

10 

23 

2 

3 

76 

69 

6 

228 

9 

11 

7 

17 

2 

102 

14 

34 

204 

4 

12 

12 

15 

3 

108 ' 

86 

45 

324 

9 

13 

24 

7 

4 

96 

96 

28 

384 

16 

14 

16 

0 

2 

70 

32 

0 

140 

4 

15 

9 

12 

1 

80 

9 

12 

80 

1 

16 

11 

16 

3 

113 

33 

48 

339 

9 

17 

22 

2 

2 

76 

44 

4 

152 

4 

18 

11 

0 

1 

74 

11 

6 

74 

1 

19 

16 

12 

2 

98 

32 

24 

196 

4 

20 

8 

16 

2 

80 

16 

30 

160 

4 

Sums 

Means . . . . 
Correction 
Corrected s 

279 
13.95 
items 

unis 

177 

8.85 

44 

2.2 

1744 

87.2 

677 

613.80 

63.20 

401 

389.40 

11.60 

4030 

3836.80 

193.20 

114.00 

96.80 

17.20 


* Coded by dividing by 10. 


The first step is to set down the first equation (1) and divide it 
through by the coefficient of the first term, with the sign changed, 
or —606.95 in this case. The resulting derived equation (I') is set 
down just below it: 

(I) 6O6.9562 - 394.1563 + 63.2064 = 14.20 

(I') -62 + 0.6493963 - 0.1041364 = - 0.02340 

The next step is to set down the second equation (II). The first 
equation (I) is then multiplied by the coefficient of the second term in 







200 


MULTIPLE LINEAR REGRESSION 


the derived equation (I')> 'which is +0.64939 in this case, and the 
products set down just below equation (II) . These two equations are 
added, giving the sum equation (S2) , which cancels out the first term, 
as shown below. The sum equation is then divided by the coefficient 
of its first term, with the sign changed, giving the second derived 
equation (II') . The second portion of the work now appears as follows : 

(II) -394.15b2 + 676.5563 + II.6O64 = 1360.60 

(0.64939) (I) 394.1562 “ 255.9663 + 41.0464 = 9.22 

(S2) 420.5963 + 52.6464 = 1369.82 

(II') -63 - O.I25I664 3.25690 

The final step in the process of elimination is to write down equation 
(III) , multiply the first equation (I) by the coefficient of the third term 
of the first derived equation (I'), which is —0.10413 in this case, and 
set the products down below equation (III) ; multiply the sum equation 
(22) by the corresponding coefficient (the second term) from the second 
derived equation (IF), —0.12516; and set these products down below 
the previous equation. Equation (III) and the two new equations are 
then added, giving an equation (^3), from which values in both bo and 
63 have been eliminated. This equation is then divided by the co- 
efficient of its first term, with the sign changed, —4.03 in this case, 
and the resulting new derived equation entered as equation (III'). (A 
method of checking each step in these computations is shown in Ap- 
pendix 1, Methods of Computation, page 464.) All the computations 


to this point are; 





(I) 

606.9552 

- 394.1563 

+ 63,2064 = 

14.20 

dO 

-5, 

+ 0.6493963 - 0.1041364 = 

- 0.02340 

(11) 

-394.1562 

-t- 676.5563 

+ 11.6O64 = 

1360.60 

(0.64939) (I) 

394.156? 

- 255.9663 

+ 41.0464 =■ 

9.22 

(S2) 


420.5963 

+ 52.6464 = 

1369.82 

(ID 


- 63 

-O.I25I664 = 

- 3.25690 

(III) 

63.2062 

+ II.6O63 

+ 17.2064 = 

193.20 

(-0.10413) (I) 

- 63.2062 

+ 41.0463 

- 6.5864 

- 1.48 

(-0.12516) (S2) 


- 52.6463 

- 6.5964 

- 171.45 

(S3) 



4.0364 

20.27 

(iir) 



-64 

- 5.02978 



EQUATION FOR THREE INDEPENDENT VARIABLES 201 

It is now very easy to compute the values of 62, 63, and 64 from the 
three derived equations. From equation (III'), 64 = 5.02978. 

Substituting this value in equation (IF) , which may be transposed 
to read 

53 = 3.25690 - 0.1251664 

we find 

63 = 3.25690 - (0.12516) (5.02978) 

= 3.25690 - 0.62953 = 2.62737 

Then, transposing equation (I'), we find 

62 = 0.02340 + 0.6493963 - 0.1041364, 
and substituting the values for 63 and 64, 

62 = 0.02340 + (1.70619) - (0.52375), 

we find 

62 = 1.20584 

The values of b^, 63, and 64, just computed, may next be verified by 
substituting them in the last equation (III). Equations (I) or (II) 
should not be used for this verification, since they will not provide a 
complete check. Equation (III) 

63.2062 + II.6O63 + 17.2064 = 193.20 

becomes, when the newly calculated values are substituted, 

(63.20) (1.20584) + (11.60) (2.62737) + (17.20) (5.02978) = 193.20; 

this works out to 

76.21 + 30.48 + 86.51 = 193.20 
or 

193.20 = 193.20 

This proves the accuracy of all the previous work. 

The work just summarized is all that is needed to solve these 
three simultaneous equations. In view of the way the terms cancel out 
during tlie second and subsequent steps of the process, the work can be 
still further simplified by omitting all entries to the left of the solid 
line which has been drawn in through the last set of entries. 



202 


MULTIPLE LINEAR REGRESSION 


Having calculated the values of the three b's, we can calculate a 
very readily. 

a = Ml — b2M2 b^Mz “ b^M^ 

= 87.2 - (1.20584) (13.95) - (2.62737) (8.85) - (5.02978) (2.20) 

= 36.06 

The regression equation for the three variables is therefore 

= 36.06 + 1.20584 + 2.62737X3 + 5.02978X4 

If we clear the fractions, the equation becomes 

Xi = 360.60 + 1.20584X2 + 26.2737X3 + 50.2978X4 

Using this equation, we may work out values of X^ and of z just as 
we did previously. (This will be left as an exercise for the student. 
Is (Tz for the new estimates larger or smaller than for the previous 
estimates? Why should it be?) 

Inteirpreting net regression coefficients. It should be noted that 
though the value of 1.20584 for 612.34, just determined, compares with 
the value of 2.13840, for 612.3, determined previously, they do not 
measure exactly the same thing. The coefficient 612.34 shows the aver- 
age increase in income for each acre increase in size of farm, with 
both the number of cows and the number of men remaining unchanged. 
The coefficient 612.3 shows the average increase in income for each 
increase of one acre in size, with the number of cows remaining un- 
changed, but without making any allowance for differences in the 
number of men. Apparently a considerable portion of the differences 
in income which on the earlier analysis would have been ascribed 
to the additional acreage is shown by this more complete analysis really 
to have been associated with the larger labor force on the greater 
acreages, rather than to the greater acreages themselves. This result 
illustrates one property of net regression coefficients in common with 
all other correlation results. They ascribe to any particular inde- 
pendent variable not only the variation in the dependent variable 
which is directly due to that independent variable but also the varia- 
tion which is due to such other independent variables correlated with 
it as have not been separately considered in the study. In the 
same way that acres, taken alone, included part of the effect due to 
cows, the effect of acres eliminating cows still included part of the 



EQUATION FOR ANY NUMBER OF INDEPENDENT VARIABLES 203 


effect due to men; and even the effect of acres holding constant the 
effect of both cows and men may still include variation due to other 
correlated variables, such, for example, as fertility of the land. These 
considerations illustrate the extreme care which is necessary in exami- 
nation of the data and the theoretical analysis of the problem before 
deciding on the variables to be correlated and the caution which must 
be employed in interpreting the results. 

Determining the regression equation for any number of inde- 
pendent variables. The same mathematical principle which has 
been used to determine the constants for regression equations involv- 
ing one, two, or three independent variables can be extended to 
problems involving any number of variables it may be desired to 
employ. 

For four independent variables the equations are: 


2(xi)bi2 .346 + 2(a:2a?3)6i3.245 + 2(a;2a:4)fci4.235] 

+ 2(0:23:5)615.234 = 2 (xia: 2 ) 

2(x 2^3)?>12.345 + 2(o;3)6i3.245 + 2(0:30:4)614.236 

+ 2(0:30:5)615.234 = 2(xio:3) 

. 

2(0:2X4)612.345 + 2(X3X4)6 i 3.245 + 2(o:|)6i4.236 

+ 2 ( 0 ' 4 X 5 ) 6 i 5.234 = 2 (o:iX 4 ) 

2(0’2^5)&12.345 + 2(X3X5)6i3.245 + 2(x4X6)6i4.235 
+ S(x|)6i5.234 = 2 (0:1X5) 

^1.2346 = -^1 612.345^2 "" ?>13. 245-^3 “ ?>14. 235-^4 ““ 6i5.234-?l!^5 


( 40 ) 


( 41 ) 


When this set of equations is compared with equation ( 38 ) for 
three independent variables, it is evident that adding the additional 
variable, X5, has made it necessary to add the additional equation, in 
which X5 appears in eacli of the product terms, and also to add an 
additional term to each of the previous equations, the additional term 
including a product summation [such as 2 (0^2X5) and 2(x3X5)] in which 
Xq appears, and also the net regression coefficient 615.234. The equa- 
tion to compute a has also been extended by adding the term 
615 234'^5-” same way the equations to be solved to de- 

termine the constants for any number of variables can be built up, 
if it is remembered that for each variable added a new term must be 
added to each of the previous equations and a new equation must be 
added, each term added including the new variable in some way. 

The products which must be computed for any given set of variables. 



204 


MULTIPLE LINEAR REGRESSION 


and the equations which will need to be solved, may be worked out 
readily by the use of the following scheme: 

Write out the required regression equation (in terms of deviations 
from the mean), as, for example, for six variables: 

^ 2^2 + ^^3^3 H" 4“ ^ 5^5 + ^63^6 = 

Multiply each term through by the coefficient of the first unknown 
(that is, by X2) and sum. This gives the first of the required equations : 

X ( X 2 ) b 2 + 2(0:22:3)63 + 2(0:22:4)64 4“ 2(0:22:5)65 4" 2(0:22:6)66 — 2(0:20:1) 

Then multiply through by the coefficient of the second unknown (0:3) 
and sum. The second equation is, therefore, 

2(0:20:3)62 4 “ 2(0:3)63 4 " 2(0:30:4)64 4 “ 2(0:3X5)65 + 2(x 3X6)66 = 2(x3.ri) 

The same process is carried out for the coefficient of each unknown in 
turn, giving five equations to be solved simultaneously to determine 
the values for the five unknowns. Setting up these equations may be 
reduced to a tabular form, as follows: 


TABLE 48 

Foem for Working Out the Equations to Derive Net Regression Constants 


Independ- 

ent 

variables 

Independent variables (in deviations from means) 

Dependent 

variable 

XX 

X2 


X4 

XB 

1 

X(J 

17 

xa 

X2 

S(a2)52 

2i(X2X8)b3 

S(X2X4)f>4 





=»SCxixa) 

xa 

^(x2Xa)b2 

^(X8)b3 

S(x3X4)b4 





= S(X1X8) 

X4 

2(z^4)b2 







= S(X1X4) 

xs 

S(X2Xs) b-2 

^(xsx^ba 

S(X4XB)b4 





-- S(liX6) 

xe 

2(x2X6)b2 

S(,x3Xc)b3 

Z(X 4 X(t)b 4 





= 2 (xiX 6 ) 

X7 

2(X2Xj)b2 


S(X 4 X 7 )b 4 





==2(xiX7) 

«8 

^(X2X8)b2 

S ( 2 : 3 X 8 ) &3 

S(X4X8)b4 





= 2 (xixh) 


The variables to be considered are listed at the head of columns 
from the left to right, ending with the dependent variable at tlic right. 
Then the independent variables are entered down the beginning of 
the lines at the left in the same order. The cells of the table are then 
filled by multiplying the variable at the head of the column by tlie 
variable at the end of the line. These products indicate the values 
to be computed (by equations [ 11 ] and [ 15 ]), to give the arithmetic 
values for the equations. The terms represent, of course, the 
net regression coefficients for the particular number of variables con- 
cerned; that is, 62 would be 612,3 foi* two independent variables, 




INTERPRETING THE MULTIPLE REGRESSION EQUATION 205 


^ 12.34 from three independent variables, and so on. The illustration is 
carried out to seven independent variables, but the scheme can be 
extended to as many as it is desired to consider. 

The equation to compute a is simply the value of the mean of the 
dependent variable, minus the product of the mean of each inde- 
pendent variable multiplied by the coefficient for the net regression 
of the dependent variable on that independent variable. 

As a matter of practical procedure, it is seldom that a problem 
is so complicated or that enough observations are available so that 
significant results for each variable will be obtained using ten or more 
variables ; and, ordinarily, analyses involving not more than five vari- 
ables are all that will yield stable results. To illustrate some of the 
details of the procedure necessary where a large number of variables 
must be considered, various methods to simplify the necessary calcu- 
lations in carrying through a problem involving a large number of 
observations are presented in Methods of Computation, Appendix 1. 

Interpreting the multiple regression equation. The same limita- 
tions apply in interpreting regression coefficients worked out with the 
effect of one or more variables held constant as when only two variables 
are considered. Thus for the data shown in Table 47: there were no 
observations with more than 18 cows, or 4 men, and none below 60 
acres or above 240 acres. For that reason, there is no basis for using 
the regression equation to estimate income beyond those limits. Fur- 
thermore, for the extreme ranges where only a few observations were 
available — for example, less than 80 acres — ^the relations could not be 
expected to hold as well as where there were more observations upon 
which to base the conclusions. In Chapter 18 a more definite basis 
for determining the probable accuracy of such estimates is discussed. 
For the present the caution may be restated, that the results may be 
expected to hold true only within the range covered by the bulk of 
the observations upon which they were based.^ 

The meaning of the regression equation 

Xi = 360.60 4 - I.2IZ2 + 26.27X3 + 50.30X4 

may be made clearer, in publishing correlation results, by working out 
the estimated values for a representative variety of conditions. Such a 

® Even within the limits of the range of observations there may be combina- 
tions of values of independent variables which are not represented by the data, 
either exactly or even approximately. Estimates for such combinations will have 
less reliability than for those combinations which are represented. For a fuller 
discussion of this source of unreliability, see Chapter 19. 



206 


MULTIPLE LINEAR REGRESSION 


statement of the conclusions covered by the previous regression equa- 
tion would be as follows; 


TABLE 49 


Aveeage Income on Faems With Varying Numbers or Acres, Cows, and Men 
(As indicated by correlation analysis) 


Labor 

force 

100 acres 

160 acres 

0 cows 

8 cows 

16 cows 

0 cows 

8 cows 

16 cows 


Dollars 

Dollars 

Dollars 

Dollars 

Dollars 

DoUars 

1 man 

532 

742 

952 

* 

* 


2 men 

• 

792 

1,003 

665 

865 

* 

3 men | 

* 

* 

1,053 

706 

916 

1,125 


* Omitted because of absence of observations representing this combination of factors. 


It should be noted in Table 49 that, according to these results, 
increasing the number of men from 1 to 2, or from 2 to 3, will add 
$50 to income, no matter whether the farm has 100 acres and 8 
cows, or 160 acres and 16 cows. Similarly, adding 8 more cows is 
indicated as having the same effect on income, no matter how large 
the farm is or how many men are employed. But that this conclu- 
sion has been reached is no proof that it is really time of the universe 
represented by the original data. Instead, such a conclusion is in- 
herent in the linear equation (35, 36, or 37) which has been used. That 
equation necessarily assumes that an increase of one unit in any one 
independent variable will always be accompanied by an equal change 
in the dependent variable. Only insofar as the actual facts agree 
with that assumption can they be represented by a linear equation. 
Subsequent chapters (particularly 14 and 21) take up methods of 
analysis which may be employed when this type of relation is not true, 
and the linear equation is therefore unable to express the facts ade- 
quately. 

Net regression coefficients, computed from a sample, may vary more 
or less widely from the true values for the universe from which that 
sample is drawn. Tests to indicate the reliability of such sample 
results are given in Chapter 18. They should always be calculated 
and considered before generalizing from such sample results. 

Summary. This chapter has presented mathematical methods 
for determining the constants of a linear regression equation, so that 





SUMMARY 


207 


changes in one variable may be estimated from changes in two or 
more independent variables. Equations so determined afford a more 
exact basis for making such estimates than do linear equations 
obtained by any other method. Furthermore, the multiple regression 
equation serves to sum up all the evidence of a large number of 
observations in a single statement which expresses in condensed form 
the extent to which differences in the dependent variable tend to be 
associated with differences in each of the other variables, as shown by 
the sample. 



CHAPTEE 13 


MEASURING ACCURACY OF ESTIMATE AND DEGREE OF 
CORRELATION FOR LINEAR MULTIPLE CORRELATION 


Standard error of estimate. After working out equations by which 
values of one variable may be estimated from those for two or more 
independent variables, it is frequently desirable to have some measure 
of how closely such estimates agree with the actual values and of how 
closely the variation in the dependent variable is associated with the 
variation in the several independent variables. Attention has been 
called in the preceding chapters to the computation of the residuals, 
Zy when the value of a variable is estimated from that of several others. 
Where the estimate is based on several independent variables the stand- 
ard deviation of these residuals serves as a measure of the closeness 
with which the original values may be estimated or reproduced just 
as well as where the estimate is based on a single variable. Continuing 
the same terminology as before, this standard deviation is still called 
the ^'standard error of estimate.^^ Thus for the regression equation 
for estimating income from known numbers of acres, cows, and men, 
the standard error of estimate is designated S 1 . 234 - The subscripts 
^^. 234 ” iudicate that that is the standard error for variable Xi when 
estimated from the independent variables Xo, X 3 , and X 4 . 

Where the size of the sample is small in proportion to the number 
of variables involved, the standard deviation of the residuals for the 
cases included in the sample tends to have a downward bias. That 
is, it tends to be smaller than the standard error which would be ob- 
served if the same constant were computed from large samples drawn 
from the same universe. 

For that reason it is necessary to adjust the observed standard 
deviation of the residuals, before it will give an unbiased estimate 
of the value of the standard error of estimate in the universe. This 
adjustment is: 


8 


2 _ 

1.234 — 


na: 


Z1.234 


n — m 


208 


(42) 



STANDARD ERROR OF ESTIMATE 


209 


where n = number of sets of observations in the sample, 

m = number of constants in the regression equation, including 
a and the &^s. 


(Where the adjusted value for ]Si .234 exceeds the value of (t\, the 
latter value should be used for the standard error.) 

The standard errors for the equations obtained when one, two, and 
three independent variables were considered in the farm-income study- 
in Chapter 12 may be summarized as follows: 


Independent variables 

Observed <tz 

n 

■m 

Adjusted standard error 

J 2 

165.15* 

20 



Xiy Xz 

70.48 

20 



Xi,X,,Xi 

66.77 

20 


■HBI 


* This value has not been shown previously. It is calculated from the data of Chapter 12. 


(In this case the correlation between Xx and X 2 is practically zero, so 
cr« = 0 - 1 . Under the rule given above, Si . 2 = o-iO The values tabulated 
in the last column illustrate the increase in the reliability of estimate as 
additional variables are taken into account. 


So far, the standard errors of estimate (except for simple or two- 
variable correlation) have been determined by actually working out all 
the estimated values, subtracting to get the individual residuals, 2 , and 
then determining their standard deviation. For linear multiple regres- 
sion equations, however, a much simpler process can be used. To com- 
pute the standard deviation of the residuals by this process, all that is 
required in addition to the values which have been used in computing 
the Vs is the value, S(a:i). The formula is as follows: 


£^1.234.. 


^(^1) — [^12.34 . . . 71(2^10:2) + hi 3.24 ... 5 

+ . . . + bln. 23 . . . (n-i)(2.'ria:n)] 


.(SX 1 .T 3 ) 


(43) 


71 — m 


Substituting the values for the regression equation computed with two 
independent variables, pages 193 and. 194, the equation becomes 

~2 _ ^(^1) “ [^12.3 (2):r 13:2) + hi 3. 2 (23:13:3)] 

^1.23 

71 — 3 

In terms of coded values for Xi, 

Sjoz ^ 5,455.20 - (2,1384) (14,20) - (3.2569) (1,360.60) 

10^ 20-3 


^1.23 

10 


4 


993.50 

17 


7.645; Si.23 


= 76.43 




210 MEASURES OF MULTIPLE AND PAHTIAL LINEAR CX)RRELATION 


The result is seen to be identical with the value computed (after 
adjustment) by the lengthy process illustrated in Table 46, on page 196, 
of working out all the individual estimates, computing their standard 
deviation, and then adjusting by equation (42) . 

Multiple correlation. The standard error of estimate for a mul- 
tiple regression equation, just as with simple correlation, measures the 
closeness with which the estimated values agree with the original 
values. The. standard error, however, offers no measure of the pro- 
portion of the variation in the dependent factor which can be ex- 
plained by, or is associated with, variation in the independent factor 
or factors. For example, in one area the farm income might be twice 
as variable as in another. If two or three independent factors such 
as those discussed came as near accounting for all the variation in 
incomes in one area as in the other, the standard errors of estimate 
would be the same in both cases. There was originally more vari- 
ance in income in the one case than in the other; therefore with the 
same amount left unaccounted for the independent factors would 
have been associated with a larger proportion of the original variance, 
in the case where it was largest to begin with, and would have been 
relatively more important in that case. In simple correlation, the 
relative importance of the independent factor was measured by the 
ratio of the standard deviation of the estimated values to the stand- 
ard deviation of the actual values, and the name coefficient of correla- 
tion was given to this ratio. In exactly similar manner, when the 
estimates are based on several variables, instead of on one, the rela- 
tive importance of all those variables combined may be measured 
by dividing the standard deviation of the estimated values by that 
of the original values. This ratio is named the coeffi,cient of rmiltiple 
correlatioUj since it measures the combined importance of the several 
independent factors as a means of explaining the differences in the 
dependent factor. 

If we use Xi( 234 ) to designate the estimates of Xi made from variables 
X 2 , X 3 , and X 4 , and use i?i. 234 > to represent the unadjusted coefficient 
of multiple correlation^ the coefficient may be defined : 

Xi(234) = <^1.234 + ?>12.34-X2 + ^>13.24X3 + 614.23X4 (44) 


0'1(234) 

^1.234 = 

<r\ 


(45) 


The same short formula which has been shown for computing the 
standard error of estimate may be employed to facilitate the computa- 



MULTIPLE CORRELATION 


211 


tion of the coefficient of multiple correlation, using only values already 
involved in equation (43) . The equation for computing the coefficient 
of correlation by this method is: ^ 


■^1.234 . . . n — 


?>12.34 . . . n(2)a:iX2) + 6 i3.24 . . . 

+ ...+ ' hrt.23 . . . (n-l)(^a:ia;n) 


(46) 


There is a tendency for the multiple correlation shown by the sample 
to be in excess of the correlation existing in the universe from which the 
sample was drawn, especially where the number of observations is small, 
or the number of variables large. For that reason the coefficient 
jBi .23 , . . w, computed as shown in equation (46), has to be adjusted 
before it will give JS 1.23 . . . n, the unbiased estimate of the correlation 
most probably existing in the whole universe. The adjustment is: 

^1.234 . . . n = 1 ” (1 "" -^1.234 . . . n) C’ (47) 

\n — m/ 


m and n have the same meaning for this equation as in equation (42). 

If the value for comes out a minus quantity, use 0 for W, 

The square of the coefficient of multiple correlation, R^, may be 
termed the coefficient of multiple determination. 

The same relations hold between the coefficient of multiple correla- 
tion and the standard error of estimate in the case of multiple correla- 
tion as in the case of simple correlation. For that reason, one of these 
measures may be computed from the other, whichever is determined 
first, according to the following equations : 

(4S) 

>Sl.234 . . . n = O'! (1 “ -^1.234 . . . n) ^ (49) 

Using equation (48) to compute the values of R from the values of S 
previously computed, the multiple coefficients for the thrcio regression 
equations previously worked out may be stated in the following different 
ways: 

1 This may be computed most conveniently by following the form shown on 
pages 467 and 469. 



212 MEASURES OF MULTIPLE AND PARTIAL LINEAR CORRELATION 


Dependent 

variable 

Independent variable(s) 

S 

Standard 
error of 
estimate 

R 

Coefficient 
of multiple 
correlation 

Coefficient 
of multiple 
determination 

(income) 

X2(acres) 

165. 15 

0* 

0 

Xi (income) 

X 2 (acres) ; Xs (cows) 

76.45 

0.892 

0.796 

Xi (income) 

X 2 (acres); X 3 (cows); X 4 (men) 

74.65 

0.898 

0.806 


* The value shown here should be that of ria. In this case it happens to be zero. 


It is evident that the correlation increases as the standard error 
decreases. Here the residual variation in each case is being compared 
with the same original standard deviation, so that that necessarily fol- 
lows. Where different studies are being compared, however, such as 
two samples with widely different original deviations in the dependent 
variable, the standard error of estimate would not necessarily decrease 
as the correlation increased, since the former is an absolute measure 
whereas the latter is a relative measure.^ 

It is evident from the figures just shown that the coefficient of 
multiple correlation, if incorrectly interpreted, makes the relationship 
seem closer than does the coefficient of multiple determination 
It cannot be demonstrated that the coefficient of multiple determina- 
tion will measure in all cases that proportion of the variance in the 
dependent factor which is associated with the independent factors. 
Yet it is sufficiently true so that, if such a statement is to be made as 
'^seventy-five per cent of the variance in income was associated with 
(or related to) variances in numbers of acres farmed, or cows milked, 
and men hired,^’ it is more accurate to use the coefficient of multiple 
determination than to use the coefficient of multiple correlation. The 
latter would overstate the case. This principle holds true both for 
simple correlation (F) and multiple correlation {R ) : the square of the 
coefficient indicates the proportion of the variance in the dependent 
variables which has been mathematically accounted for; whereas one 
minus the square of the coefficient indicates the proportion which has 
not been accounted for.® 

2 This point is of considerable significance in certain types of economic prob- 
lems, particularly in time-series analysis. For example, taking the first differences 
of a series of values frequently tends to make the deviations much larger than by 
taking deviations from trend. A study which gives a higher coefluaent of corre- 
lation for first differences than for deviations from trend may still yield the less 
accurate estimate, as measured by the standard error of estimate. 

3 See Note 7, Appendix 2. 







MEASURING SEPARATE EFFECT OF INDIVIDUAL VARIABLES 213 


The coej05cient of multiple correlation, iSi. 234 ...n, may also be de- 
fined as the simple correlation between the actual Xi values and the 
-X'i( 234 ) values estimated from the several independent factors. This 
interpretation illustrates the way it sums up the combined relation of 
the dependent variable to the several independent variables. 

(For the most convenient methods of calculating the various meas- 
ures discussed in this chapter, see Appendix 1, pages 459 to 478.) 

Measuring the separate effect of individual variables. In addition 
to the measures of the importance of all of the independent variables 
combined, it is sometimes desirable to have measures of the importance 
of each of the individual variables taken separately, while simul- 
taneously allowing for the variation associated with remaining inde- 
pendent variables. There are two different types of these measures: 
the coefficient of partial correlation and the ‘^heta” coefficients 

Partial correlation. CoeflBcients of partial correlation serve to de- 
termine the correlation between the dependent factor and each of 
the several independent factors, while eliminating any (linear) tend- 
ency of the remaining independent factors to obscure the relation. 
Thus in the problem where income was correlated with numbers of 
acres, cows, and men, the partial correlation of income with acres, 
while holding constant cows and men, indicates what the average 
correlation would probably be between acres and income in samples 
of farms in which all the farms in each sample had the same number 
of cows and the same number of men. 

If the data we have just been discussing were classified into groups 
which had the same number of cows and men in each group, and the 
correlation of the income and acres for the farms in each group was 
calculated separately, that would give a series of values for the corre- 
lation between acres and income for series of groups in each of which 
there was no variation in cows or men. If a weighted average of this 


* Discussion of the coefficient of part correlation (which was covered on pages 
182 and 183 of the first edition of this book) hass been dropped from this edition. 
It is defined by the formula 


-2 

127*34 


- *) o 

^i2.,T40'2 


^12.34 <^2 + <^1(1 — /?1.234) 


(51) 


Little practical use has been found for this coefFicient, except that it does provide 
a maximum value for the coefficient of partial correlation. Although its formal 
interpretation was correct as given previously, it seems to provide insuilicient in- 
formation to justify its detailed presentation. However, its derivation is still given 
in Note 9, Appendix 2, as before. 



214 MEASURES OF MULTIPLE AND PARTIAL LINEAR CORRELATION 


series of correlations was then calculated,® it would correspond to the 
partial correlation of income with acres, while holding cows and men 
constant (^12.34)* A similar interpretation can be made for the other 
two partial correlation coefficients. Even in problems (such as the 
present one) where the number of observations is not sufficient to per- 
mit of many such subgroups being formed, the partial correlation 
coefficient indicates about what such an average correlation in selected 
subgroups would be, if computed from a larger sample drawn from 
the same universe. 

Any group of independent variables may serve to explain some, but 
not all, of the variation in a dependent variable. If an additional 
independent variable is added, it may account for part of the variation 
left unexplained by the factors previously considered. The coefficient 
of partial correlation may be defined as a measure of the extent to 
which that part of the variation in the dependent variable which was 
not explained by the other independent factors can be explained by 
the addition of the new factor. For example, in the farm-income 
problem, considering only acres and cows, the correlation was i?i.23 = 
0.892. When acres, cows, and men were considered, the correlation was 
2 ?i. 234 = 0.898. Squaring both values shows that, whereas the two 
variables explain 79.6 per cent of the variance in income, the three 
variables explain 80.6 per cent. Whereas 20.4 per cent of the variance 
is left to be explained when the two variables are considered, only 
19.4 per cent is left to be explained when three are considered. Adding 
the additional variable has increased the variance which can be ex- 
plained by the difference between these two figures, or 1.0 per cent 
(20.4 — 19.4 per cent) . If the importance of this increase is determined 
by comparing it to the variance left unexplained before the new 


variable was added, we find that or 4.90 per cent of the variance 

20.4 

left unexplained by acres and cows, has now been found to have been 
associated with differences in numbers of men. Taking its square root 
gives the coefficient of partial correlation, 0.221. 

The coefficient is designated ri4 23, since it shows the partial correla- 
tion between and X4, after and X3 had been taken into account. 
As is indicated in the discussion, it may be computed by the formula® 


-2 

^ 14.23 


(1 i^l.23) ~ (1 -- if 1.234) 

1 ^ 1.23 


®The calculation of the average of a series of correlation coefficients would in*" 
volve the use of Fisher^s ^-transformation. 

6 This is different from the formula customarily given. See Note 7, Appendix 2, 
for its derivation. 



MEASURING SEPARATE EPFECT OF INDIVIDUAL VARIABLES 215 


For purposes of computation, this formula may be simplified to 


a4.23 


1 ~ fil.234 
1 — ^1.23 


(50) 


If it is desired to compute coefficients of partial correlation for the 
other independent variables, acres and cows, the corresponding formulas 


are' 


-2 

^13.24 


= 1 


1 - ^ 1 .; 


234 


1 — it 1.24 


^2.34 


1 ~ it 1.234 
1 “ iti.34 


It should be noticed that, although the numerator of the fraction is 
the same in each case, the denominator is different. This is a pecu- 
liarity of coefficients of partial correlation — they measure the im- 
portance of each of the several variables by determining how much it 
reduces the variation after all the other variables except it are taken 
into account. 

If we work out the new multiple correlations necessary,® JS 1.24 and 
B 1 . 34 , and substitute them in the equations given just above, the whole 
set of coefficients of partial correlation and partial determination for the 
farm-income problem works out as follows : 


-2 — 1 — 
ri3.24 -■ i — ' 


-2 _ 1 

n2.34 — ^ ~ 


1 - 0.806 
1 - 0.590 

1 - 0.806 
1 - 0.791 


= 0.527 


= 0.072 


7 Equation (50) and these following equations will give values for the partial 
regression coefficients, which will differ slightly from those computed by the clas- 
sical equations used by Yule, and then adjusted by equation (47). In view of the 
definition of the adjusted partial coiTelation coefficient just given, however, it is 
believed that this method of computation directly from the adjusted values, i 2 i .234 
and !Ri. 23 , is sufficiently accurate for all practical purposes. 

® The two new coefficients of multiple correlation are obtained by rearranging 
the arithmetic values previously computed so as to give the necessary regression 
coefficients, and then determining the value of R by equations (46) and (47). The 
two new sets of equations are: 

To determine 7? 1.24 

(2x1)512.4 + (Sx22:4)5i4.2 = ( 2 xiX 2 ) 

(2x2X4)5i2.4 + (2x4)6i 4.2 *= (2x1X4) 

Similarly for Ri,u 

(2x3)613.4 + (2x3x4)614.3 = ( 2 xiX 3 ) 

(2xaX4)6i3.4 + (2x4)614.3 ■■ (2 xiX4) 



216 MEASURES OF MULTIPLE AND PARTIAL LINEAR CORRELATION 


Relative Impoetance of Individtjal Factoes Affecting Income, as Indicated 
BY Coefficients of Paetial Coeeelation 


Factors alreadjf 
considered 

Factor 

added 

Coefficient of 
partial 
correlation 
(fi2.84j etc.) 

Reduction in 
unexplained 
variance 
(^2.34» etc.) 

Cows (X3), men (X4) 

Acres (X2) 
Cows (X3) 
Men (Xi) 

0.27 

mm 

Acres (X2), men (X4) 

0.73 

Acres (X2), cows (X3) 

0.22 

In 



When income was correlated with acres alone, there was no correla- 
tion at all. (Before adjusting for the number of observations, 
r ^2 = 0.01.) Yet the partial correlation of income with acres, while 
holding constant the variation associated with cows and men, has just 
been seen to be 0.27. Although this is not high, it is certainly more 
than no correlation at all. Furthermore, even though the correlation 
of income with cows alone is 0.64, the correlation with both acres and 
cows is 0.89. 

On the surface of the data there appears to be no relation between 
acres and income, since the positive relation of acres to income is 
hidden. Acres are negatively correlated with cows to a sufficient 
extent so that the decreased income with decreased number of cows 
offsets the increases with more acres. Only when the number of cows 
is allowed for can the influence of acres be seen. 

It is evident that a mere surface examination of a set of data 
cannot reveal which independent factors are important and which are 
■unimportant. A variable which shows no correlation with the de- 
pendent variable may yet show signifi.cant correlation after the relation 
to other variables has been allowed for. 

Investigators sometimes think they are doing ‘^research” when they 
study the relation of a given variable, say the price of a commodity, 
to a number of other factors, discard all those factors that show no 
correlation with price, and select out for further study by multiple 
correlation the factors that show the highest simple correlation with 
the price. As the preceding discussion shows, that procedure may 
result in discarding factors which would show a truly important rela- 
tion to price after the effect of other associated factors had been 
allowed for. A careful, logical examination of the problem, the selec- 
tion of the factors to be considered on the basis of tliese qualitative 
considerations, and then preliminary examination of all the inter- 










MEASURING SEPARATE EFFECT OF INDIVIDUAL VARIABLES 217 


correlations among the selected indepeiuient factors will provide more 
trustworthy results. (See Chapter 24 for a more detailed discussion 
of the places of qualitative and quantitative analysis in such studies.) 

The test whether a given independent variable may really be 
related to the dependent variable, even if it shows no apparent corre- 
lation, is whether that independent variable is correlated with other 
independent variables, which in turn are correlated with the dependent. 
Thus in the example just discussed, although acres showed no correla- 
tion with income, they did show significant correlation with cows. 
If acres had had no correlation with either income, cows, or men, it 
would have been impossible for acres to have correlation with income 
even after the relation to cows and men was allowed for. 

coefficients. The importance of individual variables may 
also be compared by their net regression coeflBcients. The size of the 
regression coefficients, however, varies with the units in which each 
variable is stated. They may be made more comparable by expressing 
each variable in terms of its own standard deviation, using the ^'beta” 
coefficients mentioned in Chapter 9. In terms of betas, the regression 
equation for four variables would be 


0-1 


Pl2,34 T P13.24 T P14.23 F O, 

^2 C 3 <^4 


Hence the partial betas may be defined 


/3i2.34 


7. "^2 

f>12.34 

<^1 


(52) 


For the problem we have been considering, the betas may be calcu- 
lated very readily: 

0-2 


Pl2.S4 ” 5i2.34 — 1.20 58 

0-1 

PlZ.24 ~ 5i3.24 ” = 2.6274 

O'! 

1^14.23 = 5i4.23 — = 5.0298 

Cl 


f — ) = 0 

\16.52/ 

(^) = 0. 

\16.52/ 

\16.52/ 


402 

926 

282 


If the relative importance of each of the different factors, as 
judged by the two different types of individual measurement, is 
compared, the relations are: 



218 MEASUEES OF MULTIPLE AND PARTIAL LINEAR CORRELATION 


Rblativb Importance op IndividttaIj Factors Apfectinq Income, as Indicated 
BY Two Different Coefficients 


Independent 

factor 

Factors held constant 

Coefficients 
of partial 
correlation 
(^ 12 . 34 ) 

Beta 

coefficients 

^ 12.34 

Acres (X 2 ) 

Cows (X 3 ), men (Z 4 ) 

0.27 

0.402 

CowsCXa) 

Acres (X 2 ), men (X 4 ) 

0.80 

0.926 

Men (Z 4 ) 

Acres (Z 2 ), cows (X 3 ) 

0.22 

0.282 


It is evident from this comparison that, although the exact values 
differ for the two sets of measures, the rank of the three variables in 
order of importance is the same and the relative sizes are comparable.® 
This does not always hold true, owing to the mathematical differences 
in the meaning of the two sets. 

Besides the coefficients which have been discussed, which measure 
either the total relative importance of all the independent variables or 
the importance of each one separately, it is sometimes desirable 
to measure the correlation between one variable and a group of others, 
after eliminating from the dependent variable that part of its variation 
imputed (by the analysis) to a single one of the independent variables. 
The problem may be stated as follows: 

Where J2i.234 measures the relation between Xi and X 2 , X^, X 4 , 
according to the regression equation (36), the problem stated is to 
determine the correlation between (Xi — bi 2 . 34 A"^ 2 ) the two re- 
maining independent variables, according to the equation 

(Xi- ^^12.34X2) — 0 ^ 1.234 + &I3.24X3 + &I4.23X4 

This could be determined by actually carrying out the operations 
indicated, but it can be much more readily computed by use of the 
formula 

Multiple correlation 
squared of 
(Xi - 612.34X2) 
with Xs and X 4 

® One other type of measure of individual importance, the coefficient of separate 
determination, is discussed in Note 11, Appendix 2. 

^^See Note 12, Appendix 2, for derivation of this equation. 


= 1 - 


(r?(l — i2i.234) 


(Ti — 26i2.34(2)a::ia:2/n) + 612 . 340 'i 


(53) 




SUMMARY 


219 


An illustration of the type of problem to which this method may be 
applied can be drawn from the field of price analysis. If X 2 in the case 
illustrated above were an index of price level, Xi the price of some com- 
modity, and Z 3 and Z 4 other factors affecting price, such as production 
and storage stocks, it might be desired to determine not only how 
closely the price of the commodity was related to all the factors, in- 
cluding the price index, but also how closely it was related to the re- 
maining factors after the variations in price found to be associated 
with changes in price level were removed from it. Formula (53) would 
enable this determination to be made. 

Reliability of results from a sample. All the coefl&cients presented 
in this chapter are subject to fluctuations of sampling just as are 
simpler coeflicients. A later chapter (Chapter 18) discusses the extent 
of these fluctuations with various sizes of samples and gives methods 
of estimating how far the coefficients from a given random sample 
may miss the true values of the coefficient in the universe from which 
the sample was drawn. 

Summary. This chapter has shown that the accuracy of a re- 
gression equation for estimating one variable from two or more others 
may be measured by the standard error of estimate. The extent to 
which variation hi the dependent variable is associated with the 
variation in the several independent variables may be measured by 
the coefficient of multiple correlation, or, with respect to variance, 
by the coefficient of multiple determination. The relative importance 
of each of the independent variables may be measured (a) by the co- 
efficient of partial correlation, relative to the variation remaining after 
the effects of the other variables have first been removed, or (b) by 
the beta coefficients, which reduce the net regression coefficients to a 
comparable basis. Finally, a method is provided for measuring the 
proportion of the variation in the dependent variable which is ex- 
plainable by a group of independent variables, after eliminating from 
the dependent variable that portion of its variability which has been 
found to be associated with another independent factor. 



CHAPTER 14 


DETERMINING THE WAY ONE VARIABLE CHANGES WHEN 

TWO OR MORE OTHER VARIABLES CHANGE: (4) USING 
CURVILINEAR REGRESSIONS 

The discussion of multiple correlation to this point has been limited 
to linear relationships — ^relations where the change in the dependent 
variable accompanying changes in each independent variable was 
assumed to be of exactly the same amount, no matter how large or 
how small the independent variable became. Thus in the farm income 
example, it was assumed that each additional cow would be accom- 
panied by the same increase in income, no matter whether it was the 
first, the tenth, or the thirtieth. Similarly, each additional acre in crops 
or each additional man employed was assumed to be accompanied 
by an identical contribution to the income, no matter how large or 
how small the business already was. It is quite evident that such 
an analysis makes no provision for there being an optimum size 
of operation for given circumstances or for differences in the con- 
tributions of different numbers of units. In this particular case, it 
assumes that there is no such thing as the principle of diminishing 
returns. Such an analysis might therefore fail entirely to reveal the 
proper size of productive unit, or the number of each of the several 
elements to be employed to yield maximum returns. 

In many other types of problems for which multiple correlation 
analysis might be used, limitation of the analysis to linear relations 
would seriously restrict its value or prevent its use altogether. In 
dealing with the effect of weather upon crop yields, several variable 
weather factors are usually concerned. There may be an optimum 
point for growth, with respect to both temperature and precipitation, 
with values either above or below the optimum tending to produce 
lower yields. Linear regressions are obviously unfitted to express 
such relations. In problems such as these, and many others which 
might be enumerated, determination of the exact curvilinear relation 
between independent and dependent variable, while simultaneously 
eliminating the effect of other factors which also affect the dependent 
variable, is the most important feature in the investigation. Unless 

220 



REGRESSION CURVES MATHEMATICALLY DETERMINED 221 


the curve itself can be determined, the other conclusions are of little 
value. 

The problem in its simplest outlines may be stated as follows: 
Given a series of paired observations of the values of a dependent 
variable A"i and two or more independent variables Z2, X3, X4, etc., 
required to find the change in accompanying the changes in X2, X3, 
and X4, in turn, while holding the remaining independent factors con- 
stant, so that for any given values of X2, X3, and X4, etc., values may 
be estimated for Xi, according to the regression equation 

Xi = a' +/2(X2) +/3(X3) +/4(X4) + etc. (54) 

The expression ^^2 (^2) is used here simply as a perfectly general term 
meaning any regular change in Xi with given changes in X2, whether 
describable by a straight line or a curve. The equation is read “Xi is a 
function of A^2 pi^is a function of etc. 

The several partial (or '^net^O regression curves may be deter- 
mined either by the use of definite mathematical expressions, one for 
each independent variable, with the constants all determined simul- 
taneously just as in linear multiple correlation; or by a method 
known as ^^successive graphic approximation,” which involves no prior 
assumptions as to the shapes of the curves. 

Multiple Regression Curves Mathematically Determined 

In using definite mathematical functions, it is necessary to express 
the curvilinear relations by simple mathematical curves of some type, 
so that the constants for the curves may be determined by methods 
similar to those already presented. If simple parabolas were used, 
involving only the first and second powers of each independent vari- 
able, equation (54) could be expressed 

Xi = « + ^2X2 + lAXl) + 53X3 + b3'(Al) + 64X4 + 64'(X^) (55) 

However, this type of parabola is not very flexible, and in practice it 
fits but very few actual curves. If the more flexible cubic parabola 
were employed, involving the first, second, and third powers of each 
independent variable, the equation would be 

Xi = a + i)2(X2) + h2'{X2) + h2"{Xi) + h-s(Xs) -H hs'{Xl) 

+ 63"(A1) + hiX,) + h,,{Xl) + hr'iXl) (56) 

This last equation for three independent variables involves 10 constants 
and inci'eases the error in their determination accordingly, and the 
clerical labor of dealing with the squared and cubed values would 



222 


MULTIPLE CURVILINEAR REGRESSION 


be large (unless they were coded). Even then, it offers no guar- 
antee that the curves for each function would truly represent the real 
relationship. The curves corresponding to the three functions in equa- 
tion ( 54 ) would be: 

j2(-^2) = 62X2 + 62' (Xi) + ^>2"(X2) 

/sCXa) = 63X3 + h^iXf) + &3"(X3) 

== 1)4X4 + 

Whether or not these curves would actually be a good fit to the true 
functions could not be told beforehand, for the problem is not to find 
the curves expressing the relation between Xi and each of the other 
variables according to the apparent relation but according to the 
underlying relation, which may become apparent only when the differ- 
ences in Xi associated with differences in the other factors have been 
eliminated. Each of the independent factors may be correlated with 
the other independent factors to a greater or less degree. Thus in the 
problem which follows, correlating X2 with X3, r = ,+ 0 . 07 ; X2 with 
X4, 0 . 00 ; and X3 with X4, — 0 . 67 . The last correlation is sufficient to 
tend to obscure the relations. When we make a dot chart showing the 
apparent relation between X^ and X3, we cannot tell how much of the 
observed differences in X^ are due to the differences in X4 associated 
with the differences in X3. For that reason we cannot be sure what 
type of curve would truly represent the differences in Xi with differ- 
ences in X3 after allowances had been made for these other factors. 
Even though the apparent relation might indicate that a straight line 
or some type of parabola would fit, there would be no guarantee that 
this would truly represent the net functional relationship. The suc- 
cessive approximation method, which makes no rigid assumption as to 
the type of curve, is therefore to be preferred.^ 

Multiple Regression Curves by Successive Approximations 

The general method of determining partial regression curves by the 
successive approximation method may be outlined as follows: 

The conditions to be imposed on the shape of each curve, in view 
of the logical nature of the relations, are first thought througli and 
stated. This procedure, for each curve, is similar to that described 
on page 109 of Chapter 6. 

^The determination of multiple regression curves by fitting definite mathe- 
matical equations is dealt with at more length in Chapter 22, on pages 306 to -lOI. 



REGRESSION CURVES BY SUCCESSIVE APPROXIMATIONS 223 


The linear partial regressions are next computed. Then the de- 
pendent variable is adjusted for the deviations from the means of all 
independent variables except one, and a correlation chart, or dot chart, 
is constructed between these adjusted values and that independent 
variable. This provides the basis for drawing in the first approxima- 
tion curve for the net regression of the dependent variable on that 
independent variable, within the limitations of the conditions stated. 
The dependent variable is then corrected for all except the next in- 
dependent variable, the corrected values plotted against the values 
of that variable, and the first approximation curve determined with 
respect to that variable. This process is carried out for each inde- 



pendent variable in turn, yielding a complete set of first approxi- 
mations to the net regression curves. These curves are then used as 
a basis for correcting the dependent factor for the approximate curvi- 
linear effect of all independent variables except one, leaving out each 
in turn ; and second approximation curves are determined by plotting 
these corrected values against the values of each independent variable 
in turn. New corrections are made from these curves, and the process 
is continued until no further change in the several regression curves 
is indicated. 

The process of determining net curvilinear regressions by the suc- 
cessive graphic approximation method may be illustrated by the data 
shown in Table 50. These data show, for a period of 38 years, the aver- 



224 


MULTIPLE CURVILINEAR REGRESSION 


age rainfall during June, July, and August, for nine weather stations 
scattered through the Corn Belt. This precipitation has been desig- 
nated as variable X^. The average temperature during the same 
months, at the same stations, has been designated as X4. The average 
yield of corn per acre, in the six leading Corn Belt states, is shown as 
Xi — ^the variable whose fluctuations are to be explained, so far as 
possible, by the other factors. 

It is evident from the table that there has been a marked upward 
trend in corn yield during this period, although there has not been a 
similar trend in rainfall or temperature. Plotting each one of the 
three factors, Z3, X4, and X^ as shown in Figure 32, we notice, how- 
ever, that there have been marked though irregular long-time cycles 
in rainfall and temperature during the period. To a certain extent 
the upward swing in yields has agreed with the high point of the 
rainfall cycles, particularly from 1919 to 1921. It is not safe, there- 
fore, to fit a long-time trend to yield and to assume that in removing 
that trend we are merely taking out the effects of such factors as 
better varieties, improved methods of tillage, or concentration of 
acreage in the more fertile sections. Since there is some associa- 
tion between rainfall and time, at least over considerable periods, in 
eliminating all the variation associated with time we might be 
eliminating a part of the variation which really reflected differences 
in rainfall. Accordingly we may make time itself one of the factors 
in the multiple correlation and ascribe to time only that part of 
the long-time change in yields which is not associated with differences 
in rainfall or in temperature. Each year, numbered from 0 up, is 
therefore included as one of the factors in the multiple correlation^ 
and is designated as variable X2. 

Before starting the statistical process, we must state the conditions 
to be observed in fitting a curve to each function. For rainfall, the 
considerations are quite similar to those discussed in Chapter 8 for 
irrigation water applied, so we shall use the same conditions as stated 
there (page 152). 

For temperature, the range of possible relations might be wider. 
There may be certain temperatures to which the plant does not respond 
and then certain higher temperatures which produce a marked response. 
Again, if the temperature is too high, a marked reduction in yield 

2 Note the parallel treatment of changes in time as an independent factor in 
R. A. Fisher, Statistical Methods jor Research Workers, second edition, p. 174 . 



REGRESSION CURVES BY SUCCESSIVE APPROXIMATIONS 225 


TABLE 50 


Yield op Corn, Rainfall, and Temperature in Six Leading States; and 
Yield Estimated by Linear Regressions on Three Factors ♦ 




226 


MULTIPLE CURVILINEAR REGRESSION 


might be produced.® These considerations lead to the following condi- 
tions for the temperature curve: 

1 . It might rise none at all or slowly in the lower range, then 
more steeply, then taper off until a maximum is reached. 

2 . It might decline after the maximum, gradually or sharply, 
but would have only one maximum. 

3. It might have two points of inflection, one where it started to 
rise rapidly, the second where it starts Jo rise less rapidly. 

With respect to the third curve, that for trend, there is no a 'priori 
reason to expect any given shape during the period concerned, except 
that there be no sudden changes from year to year. Accordingly, 
the only condition imposed is that the trend have a smooth, gradual 
change, with no sharp inflections. 

As a preliminary step before starting to determine the net regres- 
sion curves, we may examine the apparent relation of yield to rainfall, 
before the other factors (temperature and time) are taken into 
account. 

The apparent relation between rainfall (Z 3 ) and yield (Xi) is 
indicated in Figure 33, by a dot chart of the relation, with the average 
yield indicated for each group of years of similar rainfall. The broken 
line connecting these averages indicates that there is a marked curvilin- 
ear relation, the lower increases in rainfall being accompanied by much 
greater increases in yield than the higher increases. Fitting a straight 
regression line to these two variables, the relation is found to be 

Zi = 23.55 + 0 . 776 X 3 

This line is accordingly drawn in on the chart, cutting across the curve 
indicated by the line of group averages. 

Although Figure 33 shows yields to be definitely associated with 
differences in rainfall, it must be noted that rainfall is significantly 
correlated with X4, temperature, the correlation being ~ — 0.67, 
and is also slightly correlated with time. To some extent, tlicn, the 
changes in yield shown in the figure to be associated with differences in 
rainfall may really be due to concomitant differences in the other two 

® More elaborate investigations, experimental and statistical, have shown that 
the effect of both temperature and rainfall vary at different times of the season, 
and especially at certain critical times in the growth of the plant, such as at tas- 
seling. Al^, the particular combination of moisture and heat may be important. 
These possibilities will be referred to subsequently, in connection with more refined 
and elaborate methods of analysis. 



REGRESSION CURVES BY SUCCESSIVE APPROXIMATIONS 227 


factors. The extent to which these other two factors may have influ- 
enced the relations can be judged by determining the multiple correla- 
tion of Xi with all three factors, and then noting how the regression of 
Xi on X3 alone (613), which has just been shown plotted in the figure, 
compares with the net regression of X^ on X3 (bi3.24) determined while 
simultaneously holding constant the linear effects of X2 and X4. The 
first step toward determining the net regression curve, therefore, is 
to determine the multiple regression equation and the coefficient of 
multiple correlation, according to* the methods outlined in Chapters 
12 and 13, 

Cor?l'i 
Bushds 
40 

35 

30 

25 

20 

6 8 10 12 16 18 
X3 Rainfall, in inches 

Fia. 33. Apparent relation of corn yields to rainfall (with simple and net regres- 
sion lines). 

The regression equation works out to be 

Xi = 53.505 + 0.146X2 + 0.537X3 - 0.405X4 

and the multiple correlation, adjusted for the number of observations 
and constants, 0.49.'^ 

^ Using units of years of time, inches of rainfall in tenths, and degrees of tem- 
perature in tenths, and corn yields in tenth bushels, we find the normal equations 
for the data of Tabic 50 to be: 

4,669.605i 2.34 248.005i3.24 — 8.505i4.23 ” 6,813.00 

248.006i2.34 + 18,989.066 i3.24 - 10,279 . 41614,23 = 14,726.97 
- 8 . 6 O 612.34 “ 10,279.416i 3.24 + 12,408.866i4.23 = -8,442.64 

wai = 70,456.03; ai = 43.0; or 4.3 bushels. 





228 


MULTIPLE CURVILINEAR REGRESSION 


This result shows that when the net linear influence of trend and 
of temperature is allowed for, yield increases on the average only 
0.54 bushel for each increase of one inch in rainfall, whereas, before 
these other factors were taken into account, yield appeared to increase 
0.78 bushel with each additional inch of rainfall. The difference 
between the simple regression and the net regression may be shown 
by plotting the latter as well in Figure 33.® It is then quite apparent 
how different are the relations as shown by the two lines. 

Considering the effect of the other factors reduces the linear regres- 
sion of Xi on X3 by nearly If other factors have so much effect 
on the average linear relation, they may have an even greater effect on 
the shape of the curve. The net regression line in Figure 33 shows the 
average change in the values of Xi with different values of X3, after 
the differences in X^ and X4 are taken into account. The average yield 
for different groups according to rainfall, connected by the broken line, 
shows definitely that the simple regression line is but a poor indication 
of the underlying relation between Xi and X3. The net (or partial) 
regression line may be an equally poor indication of the relation with 
the other factors held constant. What is needed is some way of seeing 
the differences in the individual values of for different values of X3, 
after the variation due to X2 and X4 has been eliminated. It is impos- 
sible to do this entirely, for we have as yet no measure of the curvilinear 
relation of Xi to X2 or X3. But we do have our net regression 
coefficients, which measure the linear regression of X^ on these other 
factors, and by using them we can eliminate from Xi that part of its 
variation associated with the linear effects of X2 and A^4, and then see 
if that gives us any clearer picture of the curvilinear relation between 
Xi and X3. 

Determining the “first approximation” net regression curves. 

Having determined the linear multiple regression equation, we next 

® The net regression line, showing the change in yield with changes in rainfall 
while holding constant time and temperature, may be computed from the multiple 
regression equation by substituting the average values for time and for temperature 
for X 2 and X 4 , and then working out the new constant. l'^)r the data given in Table 
60, the averages are: 

M 2 = 18.500; Mg = 10.784; M 4 = 74.270; Mi = 31.016 

If we substitute the means of X 2 and X^ for their values in the multiple regression 
equation, that equation becomes: 

Xi = 53.505 + (0.146) (18.500) + 0.637X3 ~ (0.405) (74.270) = 20.124 + 0.537X3 
The net regression line in Figure 33 is therefore drawn in from this last equation. 



DETERMINING ‘TIRST APPROXIMATION” CURVES 


229 


calculate the estimated value of Zi for each one of the 38 observations, 
by substituting the corresponding values of X'2, X3, and Z4 in the 
equation. Each of the estimated values (X' ) is then subtracted from 
the actual value (Xi), giving the residual values {z), as also shown in 
Table 50. 

The next step is to construct a scatter diagram to show the relation 
between variations in X3 and the variation in X^ after that associated 
with X2 and X4 has been eliminated. To do that, the net regression 
line for X^ on X3 is plotted on Figure 34, just as it had been on 
Figure 33.® 

The residuals for each observation, from Table 50, are then plotted 
on the chart, with their X3 value for abscissa and with the value of z as 
ordinate from the net regression line as zero base. For the first ob- 
servation, X3 = 9.6 and 2 = 3.9. The ordinate of the point on the 

net regression line corresponding to X3 = 9.6 is 31.3, and the dot for 
this observation is correspondingly plotted 3.9 lower than that, at 
27.4. For the second observation, X3 = 12.9 and z = + 2.1. The ordi- 
nate of the point on the regression line corresponding to X3 = 12.9 is 
33.1; so the dot for this observation is plotted at 33.1 + 2.1, or 35.2. 
After the corresponding operation has been carried out for all the 
observations, the figure appears as shown in Figure 34.’' 

If Figure 34 is compared with Figure 33, it is readily seen that the 
scatter of the dots has been reduced. This will always be true when 
the other variables show any significant relation to the dependent 
factor; that is, when 12i,234 exceeds r^s. The scatter is reduced because 


® To plot the line, all that is necessary is to take the equation of the line to be 
used (see previous footnote) 

Xi = 26.124 + 0 . 537 X 3 

and substitute any two convenient values for A" 3 , say 6 and 16. 

For X3 = 6, Xi = 26.124 -f (0.537) (6) = 29.35 
For X3 = 16, Xi = 26.124 4" (0.537) (16) = 34.71 

With these two sets of coordinates, the line is then drawn in with a straight edge 
through the points indicated. 

The .simplest way of plotting the individual observations is to use a scale, 
which can be slid along the rPgre.s.sion lino as zero. The values of z are then plotted 
directly as vertical doviation.s from the points on the rogre.ssion line (‘orrosponding 
to the particular values of the independent variable considered, as X'a in the pres- 
ent case. 



230 


MULTIPLE CURVILINEAR REGRESSION 


that part of the variation in Xi which can be expressed as net linear 
functions of X 2 and X 4 has now been eliminated.® 

Consideration of Figure 34 can be facilitated by computing the 
means of the ordinates corresponding to the values of X 3 falling within 
convenient intervals. These can be obtained by simply averaging 
together the z values for each selected group of values of Z 3 and 
plotting those averages as deviations from the regression line, just as 
the individual deviations were plotted previously. The necessary 
averages are as shown in Table 51. 


TABLE 51 

Average Values of 2, for Corresponding Values 


X’a values 

Number of cases 

Average of A '3 

Average of z 

Under 8.0 

4 

7.30 

-3.85 

8.0- 9.9 

10 

9.19 

+0.16 

10.0-10.9 

8 


+ 1.49 

11,0-11.9 

5 

11.40 

+2.56 

12.0-13.9 

8 

12.76 

-0.52 

14.0 and over 

3 

15.60 

- 2.20 


These averages, when plotted the same as the individual observa- 
tions and connected by a broken line, give the irregular line also shown 
in Figure 34. Comparing this line with the similar one in Figure 33, 

8 This can be readily proved. Each point on the net regression line was ob- 
tained by the formula : 

(A) Xi = ai.234 + &l2,34-^2 + &I 3 . 24 A 3 

To these values have been added the residuals, z. These residuals equal i - X[, 
and therefore for each observation are equal to 

(B) Xi — 01.234 — ^2.34^2 “ Z>13.24A^3 “ ^>14.23^4 

The ordinate of each dot in Figure 34 is the ordinate of the regression lino plus z, 
and is therefore equal to the sum of the two equations, (A) and (B). If we use tt to 
represent these ordinates, they are therefore equal to 

^ = Gi, 234 + &12.34-^2 + fel3.24Ar3 + 614 , 23^^4 + ATi — 01.234 “ ?Jl2.34A"2 
— 5i 3.24A'3 — hl4.23A^4 

TT — Xi — 612. 34(^2 — M2) — & 14 . 23 (X 4 — M4) 

^ = Xi — 612.342:2 — 614.232:4 

The adjusted values shown on Figure 34 are therefore simply the values of Xi 
less net linear corrections for deviations in X 2 and X 4 from their mean values. 






DETERMINING ^TIRST APPROXIMATION” CURVES 


231 


on page 227, we see that though the lines are in general similar there 
are some marked differences. The average for the second group (Z3 = 
8.0-9.9) is now above the straight net regression line, whereas pre- 
viously it was below it. Likewise the average for X3 = 14 and over is 
now slightly below the average for X3 = 12.0 to 13.9, whereas before 
it was a little above it. Also, the difference between the first two 
averages is not so large as it appeared before. Apparently part of the 
previous deviations reflected other independent factors. 



Fig. 34. Rainfall and yield of corn adjusted to average temperature and year, and 
first approximation curve fitted to the averages. [The notation /(X 3 ) on the 
figure corresponds to JsiX'^).] 


It is quite evident that a regression curve is indicated, rising sharply 
to a maximum yield between 10 and 12 inches of rain, then declining 
gradually for higher rainfalls. Such a curve is accordingly drawn in 
freehand, passing as near to the several group averages as is consistent 
with a continuous smooth curve, and yet conforming to the limiting 
conditions as to its shape. This curve is the first approximation 
to the curvilinear function. 

=fz{X,) 

which was required to be determined while simultaneously taking into 
account the curvilinear effects of X 2 and on . It is only a first 
approximation because it has been determined while allowing for only 
the net linear effects of the other two variables. If their cxirvilinear 
effect were determined and allowed for, that might change somewhat 
the shape of this curve. 

The next step is to determine similar first approximations to the 
curvilinear relation between Xi and X2, and between Xi and X4, with 




232 


MULTIPLE CURVILINEAR REGRESSION 


the net linear effects of the other variables eliminated just as has been 
done for X 3 . It is not necessary to plot the apparent relation between 
Xi and Z 2 or and Z 4 . This was done in the case of X 3 (Figure 33) 
solely to illustrate the difference between taking the apparent relations 
and taking the net relations after the linear influence of the other fac- 
tors had been allowed for (Figure 34) . Instead, we may proceed at once 
to determine the net relations for Xi to X 2 . Figure 35 shows this step. 



Fia. 35. Time and yield of com adjusted to average temperature and rainfall, and 
first approximation curve fitted to the averages. [The notation /(A’'2) on the 
figure coiresponds to 

This figure is constructed exactly as was Figure 34, by the following 
steps: (1) Plot the net regression line.*^ (2) Plot in the individual 
residuals, z, as deviations from that line.^^ (3) Average the residuals 
grouped according to Z 2 , plot the group averages, and connect them by 

® The regression equation, for mean values of X3 and ^"4, becomes 

Xi = 53.505 + 0.146A2 + 0.537(M3) - 0.405(^4) 

= 53.505 -f 0.146A2 + (0.537) (10.784) - (0.405) (74.276) 

= 29.214 + 0.146X2 

This equation is then the equation to which the not regression line in Figure 35 
is drawn. Substituting the values As — O and A'o — 20 in the equation, values for 
Ai of 29.214 and 32.13 are obtained, giving the coordinate points for drawing in 
the line. 

For the first observation, A^^O and z = — 3.9. The point on the regression 
line corresponding to A2 — 0 has an ordinate of 29.2. The dot for this observation 
is accordingly plotted at 29.2 — 3.9, or 25.3. For the next observation, Xo = 1 and 
3 = 2.1. The corresponding ordinate on the regression line is 29.4, so the dot is 
plotted at 29.4 + 2.1, or 31.5. The dot for each observation is plotted in turn in 
the same way, with a sliding graphic scale to place the dots above or below the 
regression line. 



DETERMINING “FIRST APPROXIMATION” CURVES 233 


a broken line. (4) Draw in a smooth curve through the line of aver- 
ages, if a curve is indicated, conforming to the limiting conditions 
stated for this curve. 

After the first two steps have been carried out, just as described for 
Figure 34, grouping and averaging the residuals with respect to X 2 
give the averages shown in Table 52. 

TABLE 52 

Average Values of z for Corresponding X2 Values 


X 2 values 

Number of cases 

Average of Ah 

Average of z 

0- 7 

8 

3.5 

-0.38 

8-15 

8 

11.5 

+0.24 

16-23 

8 

19.5 

+0.64 

24-31 

8 

27.5 

+0.26 

32-37 

6 

34.5 

-1.00 


The average residuals shown in the table are then plotted in above 
and below the regression line in Figure 35 and connected by a broken 
line. This line of averages indicates that corn yield (for years of similar 
rainfall and temperature) rose rapidly during the earlier years, then 
more and more gradually, until during the last ten years it tended to 
remain about on the same level, A smooth continuous curve is there- 
fore drawn through the averages, completing step (4) and giving the 
first approximation to the curvilinear net regression of on X^, 
/2(A^,). 

The same operations are then carried out for X^ as shown in Figure 
36. After drawing in the net regression linc,’^ and plotting in tlie 
individual observations,^"’ we group the residuals on X 4 and average, 
with the results shown in Table 53. 

The not rogrossion line for Xi and X.i may be determined by an alternative 
method to that used before. On Huch chartH as Figures 34, 35 or 36, the not regres- 
sion line will always pass through the mean of the two variables. For Figure 36, 
therefore, A'l will have its menn value, 31.92, when X.\ has its mean value, 74.28. 
From the n('t regression coefficient, bi 4 . 2 .*i, it is evident that each unit increase in 
X 4 is accompanied by —0.405 unit increase in A'l. If X.i i.s increased from 74.28 
to 78.28, or 4 units, Xi will change by (-”•0.405) (4) , or —1.62. For Ah= 78.28, A'j 
will therefore be 31.92 — 1.62, or 30.30. This gives the two sets of points necessary 
to locate the line ; when Ah — 74.28, A'l = 31.92; and wlnm A ^4 -= 78.28, Ai = 30.30. 

The individual residujils are plotled in the sam(> way as indicatc'd in the other 
two cases; the residual —3.9 for Ah— 74.8 is plottc'd 3.9 iinils below Iho corre- 
sponding point on the regression line, and similarly for the other observations. 




234 


MULTIPLE CURVILINEAE REGRESSION 


TABLE 53 

Avbeage Values of z fob Corresponding X4 Values 


X 4 values 

Number of cases 

Average of X 4 

Average of z 

Under 72.0 

4 

71.08 

-1.28 

72.0-72.9 

5 

72.58 

-1.24 

73.0-73.9 

6 

73.36 

+1.46 

74.0-74.9 

10 

74.30 

+0.49 

75.0-75.9 

7 

75.33 

+0.91 

76.0-76.9 

5 

76.44 

+0.64 

77.0 and over 

2 i 

78.00 

-5.20 

76 . 0 and over 

7 

76.89 

-1.03 


The last group, on the first grouping, has but two cases, so the last 
two groups are combined, giving the averages shown in the last line. 
The fact tlxat both the items above 77 degrees are low, also evident in 

adjusted yield 



Fig. 36. Temperature and yield of com adjusted to average rainfall and year, and 
first approximation curve fitted to the averages. [The notation /(A’'4) on the 
figure corresponds to /4(A'4).] 


Figure 36, would give a little more reliability to the average based 
on only two items; but it is generally unsafe to give such an extreme 
bend to the end of a regression curve as this would call for, on the 
basis of so few observations. The larger grouping will therefore be 
used in this case, leaving the subsequent approximations to determine 
whether the more extreme bend is justified. 


ESTIMATING Xi FROM FIRST APPROXIMATION CURVES 235 


The line of averages in Figure 36 indicates that yields may tend 
to rise as temperature increases up to between 73 and 75 degrees, and 
then to fall as the temperature goes still higher. A smooth curve is 
therefore drawn in, averaging out the irregularities shown in the 
broken line of the group averages and conforming to the limiting 
conditions stated on page 226. It does not make much difference if 
these first approximation curves are not drawn in in exactly the 
right position or shape, as the subsequent operations will tend to 
correct them to the proper shape if the original one is incorrect. It 
is for that reason that fairly accurate results can be secured by this 
graphic process, even though the true shape of the curves is not known 
at the beginning. 

Estimating Xi froyn the first approximation curves. We have now 
arrived at first approximations to the net regression curves for Xi, 
against each of the three factors. It must be remembered that in 
making the adjustments on Xi to arrive at these curves, only the net 
linear effects of the other independent variables have been eliminated. 
Now that we have at least an approximate measure of the curvilinear 
relations of Xi to the independent variables, making adjustments to 
eliminate these approximate curvilinear effects may enable us to 
determine more accurately the true curvilinear relation to each variable. 

The first step in the next stage of the process is to work out estimated 
values of Xi based on the curvilinear relations. To do this we may 
designate the relation between Xi and X 2 shown by the curve in Figure 
35 as / 2 (X 2 ); the relation between Xi and X 3 shown in Figure 34 as 
fs{Xs); and the relation between Xi and X 4 shown in Figure 36 as 
/ 4 (X 4 ). The estimates of Xi may then be worked out by the regression 
equation 

Xi = tti.234 + /2(^2) +/3(-^3) + /4(-^4) (57) 


The symbol X'l is used to designate this second set of estimates, just 
as X[ was used to designate the first set, worked out from the linear 
regression equation. The constant ni .234 is different from the constant 
0 x 2.34 used in equation (36) ; its value is given by the formula 


01.234 — Ml — 


nfM +f^(x,) +/;(X4)i 


(58) 


To work out al .234 according to equation (58), it is first necessary to 
work out the value /a (Xg) + fAXs) +fAX 4 ) for each set of observations. 
For the first observation, for example, X 2 = 0, X 3 = 9.6, and X 4 = 74.8. 
From ^^(Xa), given in Figure 35, the curve reading (or ordinate) cor- 



236 


MULTIPLE CURVILINEAR REGRESSION 


responding to a value of 0 for X 2 is 27.3. For Figure 34, the 

ordUinate of the curve corresponding to X^ = 9.6 is 31.7. For/i(X 4 ), 
Figure 36, the curve ordinate corresponding to X 4 — 74.8 is 32.5. 
The value [^(^ 2 ) + fz(Xs) +/ 4 (-X' 4 )] for the first observation is there- 
fore [27.3 + 31.7 + 32.5], or 91.5. The sum of these values for each 
observation is the value required in equation (58). 

Before continuing the process of reading each value from the 
charts for the remaining observations, it should be noted that, since 
many observations of each variable have the same values, the same 
point would be read from each chart many times. The process of 


TABLE 54 

Values of Corresponding to Given Values of X2, from the First 
Approximation Curve 




^2 

AiXi) 

Xi 

MXi) 


AiXi) 

0 

27.3 

10 

30.8 

20 

32.8 

29 

33.4 

1 

27.8 

11 

31.0 

21 

33.0 

30 

33.5 

2 

28.2 

12 

31.3 

22 

33.1 

31 

33.5 

3 

28.6 

13 

31.5 

23 

33.1 

32 

33.5 

4 

29.0 

14 

31.7 

24 

33.2 

33 

33.5 

5 

29.4 

15 

31.9 

25 

33.2 

34 

33.5 

6 

29.7 

16 

32.1 

26 

33.3 

35 

33.5 

7 

30.0 

17 

32.3 

27 

33.3 

36 

33.5 

8 

30.3 

18 

32.5 

28 

33.4 

37 

33.5 

9 

30.6 

19 

32.6 






working out the computations can be much simplified by reading each 
required value from each chart once for all and recording it so that 
it can be used each time. Since each chart indicates each individual 
observation for each independent variable, only those points for which 
there are observations need be recorded. Carrying out this process, 
we may record the functional relations as shown in Tables 54, 55, and 
56, which show the readings from Figures 35, 34, and 36, resiicctivcly.^''* 


In entering these values it is not worth while reading further than t he first 
decimal, for the line is not drawn more accurately than to within 0.1 or 0.2. The 
accuracy depends, of course, on the scale ; but it is not worth using very largo charts 
to secure spuriously high accuracy, when the standard error of any particular point 
on the curve is probably several units and when the curve is only a first approxi- 
mation, subject to subsequent modification. 



ESTIMATING Xi FROM FIRST APPROXIMATION CURVES 237 

The values to determine ai .234 may now be worked out in orderly 
manner, as shown in Table 57, in the fourth to the seventh columns. 

TABLE 56 

Values of Xi Corresponding to Given Values of X3, from the First 
Approximation Curve 


Xt 

MXi) 

Xi 

A(Xs) 

Xi 

MXi) 

X 3 

/3(^3) 

6.8 

24.6 

9.5 

31.5 

10.8 

33.4 

12.9 

33.3 

6.9 

25.0 

9.6 

31.7 

11.0 

33.5 

13.0 

33.2 

7.7 

27.1 

9.9 

32.4 

11.3 

33.6 

13.6 

32.9 

7.8 

27.4 

10.0 

32.5 

11.5 

33.7 

13.9 

32.7 

8.0 

27.9 

10.1 

32.6 

11.6 

33.7 

14.1 

32.5 

8.7 

29.7 

10.4 

33.1 

12.0 

33.7 

16.2 

31.0 

9.3 

31.0 

10.6 

33.3 

12.1 

33.6 

16.5 

30.8 

9.4 

31.2 

10.7 

33.4 

12.5 

33.5 




TABLE 56 

Values of Xi Corresponding to Given Values of X4 , from the First 
Approximation Curve 


Xi 

/i(Xi) 

A'4 

A(Xi) 

Xi 

/i(Xi) 

Xi 

AiXi) 

69.9 

30.2 

73.0 

32.5 

74.2 

32.8 

75.7 

31.6 

71.0 

31.0 

73.2 

32.6 

74.3 

32.7 

75.8 

31.5 

71.5 

31.4 

73,3 

32.6 

74.6 

32.6 

76.0 

31.3 

71.9 

31.7 

73.0 

32.7 

74.8 

32.5 

76.2 

31.0 

72.0 

31.8 

73.7 

32.7 

75.0 

32.3 

76.9 

30.1 

72.6 

32.2 

74.0 

32.8 

75.2 

32.1 

77.6 

29.0 

72.8 

72.9 

32.3 

32.4 

74.1 

32.8 

75.3 

32.0 

78.4 

27.6 


This c()mi)utatioTi gives us the sum of the respective functional 
values for the 38 observations. Substituting this sum and the number 
of observations in equation (58), we find the recpiired constant to be 

a; .234 = 1.916 ~ = - 63.397 


Since the functional values for our regression equation are only expressed 
to one decimal point, we shall use —63.4 for 01 . 234 ? which will result in 
the e^stimated values being 0.003 unit too low, on the average. 



238 


MULTIPLE CURVILINEAR REGRESSION 


It is now possible to complete the process of computing X", the 
estimated value of Xi, using the first approximation curves, according 


TABLE 57 


Computation op Functional Values Coebespondinq to Independent Vaki- 
ABLES, OP THE ESTIMATED VALUE OP Xl , AND THE NeW ReSIDUAL, FOB EaCH 
Observation 


1 

X2 

. 1 

Xz 

3 

Xi 

4 

A<.Xi) 

5 

fz(Xz) 

6 

/t(Xi) ■ 

7 

fziXo) 

■f/sCXa) 

i-fiiXi) 

8 

S (/)+«' 

9 

Xl 

10 

Xi-X'i 

a'' 

0 

9.6 

74.8 

27.3 

31.7 

32.6 

91.6 

28.1 

24.6 

-3.6 

1 

12.9 

71.5 

27.8 

33.3 

31.4 

92.6 

29.1 

33.7 

4.6 

2 

9.9 

74.2 

28.2 

32.4 

32.8 

93.4 

30.0 

27.9 

-2.1 

3 

8.7 

74.3 

28.6 

29.7 

32.7 

91.0 

27.6 

27.6 

-0.1 

4 

6.8 

75.8 

29.0 

24.6 

31.5 

85.1 

21.7 

21.7 

0 

5 

12.5 

74.1 

29.4 

33.5 

32.8 

96.7 

32.3 

31.9 

-0,4 

5 

13.0 

74.1 

29.7 

33.2 

32.8 

95.7 

32.3 

30.8 

4.5 

7 

10.1 

74.0 

30.0 

32.6 

32.8 

96.4 

32.0 

29.9 

-2.1 

8 

10.1 

75.0 

30.3 

32.6 

32.3 

95.2 

31.8 

30.2 

-1.6 

9 

10.1 

75.2 

30.6 

32.6 

32.1 

95.3 

31.9 

32.0 

0.1 

10 

10.8 

75.7 

30.8 

33.4 

31.6 

95.8 

32.4 

34.0 

1.6 

11 

7.8 

78.4 

31.0 

27.4 

27.6 

86.0 

22.6 

19.4 

-3.2 

12 

16.2 

72.6 

31.3 

31.0 

32.2 

94.5 

31.1 

36.0 

4.9 

13 

14.1 

72.0 

31.5 

32.5 

31.8 

95.8 

32.4 

30.2 

-2.2 

14 

10.6 

71.9 

31.7 

33.3 

31.7 

90.7 

33.3 

32.4 

-0.9 

15 1 

10.0 

74.0 

31.9 

32.5 

32.8 

97.2 

33.8 

36.4 

2.6 

16 1 

11.5 1 

73.7 

32.1 

33.7 

32.7 

98.5 

35.1 

30.0 

1.8 

17 

13.6 

73.0 

32.3 

32.9 

32.5 1 

97.7 

34 . 3 

31.5 

-2.8 

18 

12.1 

73.3 

32.5 

33.6 

32.6 

98.7 

35 . 3 

30 . 5 

-4.8 

19 

12.0 

74.6 

32.6 

33.7 

32.6 

98.9 

35 . 5 

32 . 3 

-3.2 

20 

9.3 

73.6 

32.8 

31.0 

32.7 

96.5 

33.1 

34 . 9 

1 .8 

21 

7.7 

76.2 

33.0 

27.1 

31.0 

91.1 

27.7 

30.1 

2.4 

22 

11.0 

73.2 

33.1 

33.5 

32.6 

99.2 

35.8 

36.9 

1 ,1 

23 

6.9 

77.6 

33.1 

25.0 

29.0 

87.1 

23.7 

26. H 

3.1 

24 

9.5 

76.9 

33.2 

31.5 

30.1 

94.8 

31 .4 

30 . 5 

-0.9 

25 

16.5 

69.9 

33.2 

30.8 

30.2 

94.2 

30.8 

33.3 

2.5 

26 

9.3 

75.3 

33.3 

31.0 

32.0 

96.3 

32 . 9 

29.7 

-3.2 

27 

9.4 

72.8 

33.3 

31.2 

32.3 

96 . 8 

33.4 

35 . 0 

1 .6 

28 

8.7 

76.2 

33.4 

29.7 

31.0 

91 . 1 

30.7 

29 . 9 

-O.S 

29 

9.5 

76.0 

33.4 

31.5 

31.3 

96 . 2 

32 . 8 

35 . 2 

2.4 

30 

11.6 

72.9 

33.5 

33.7 

32.4 

99 . 6 

36 . 2 

38 . 3 

2.1 

31 

12.1 

76.9 

33.5 

33.6 

30.1 

97.2 

33.8 

35 . 2 

1,4 

32 

8.0 

75.0 

33.5 

27.9 

32.3 

93.7 

30 . 3 

35 . 5 

5.2 

33 

10.7 

74.8 

33.5 

33.4 

32.5 

99 . 4 

36 . 0 

36.7 

0.7 

34 

13.9 

72.6 

33.5 

32.7 

32 . 2 

98 . 4 

35 . 0 

26 . 8 

-8.2 

35 

11.3 

75.3 

33.5 

33.6 

32.0 

99.1 

35.7 

38.0 

2.3 

36 

11.6 

74.1 

33.5 

33.7 

32 . 8 

100.0 

36.6 

31.7 

-4,9 

37 

10.4 

71.0 

33.5 

33.1 

31.0 

97 . 6 

34 . 2 

32 . 6 

- 1 .6 

Totals. . 



. 1,208.4 

1,204.2 

1,209.3 

3,621.9 





to equation (57), and the constant which has just been computed. 
When equations (57) and (58) are compared, it is evident that, except 



DETERMINING “SECOND APPROXIMATION” CURVES 239 


for the constant term, X'i is equal to the values that have just been 
computed in the seventh column of Table 67. Accordingly, all that is 
necessary is to subtract 63.4 from each of those values. This step is 
shown also in Table 57, in the eighth column. 

The column headed Xi shows the estimated values obtained by 
this process. The next step is to see whether the new estimates come 
any nearer to reproducing the observed values of Xi than did the first 
set of estimates, based on the linear regression equation. We therefore 
compute a new set of residuals, z'\ by subtracting the new estimates 
from the actual values of Xi. This step, also, is shown in Table 57. 

= Xi - Xi (59) 

If the individual residuals shown are compared with the residuals 
obtained by the linear regression, as computed in Table 50, it will be 
seen that in general the new residuals are smaller than the previous 
ones, though the reverse is true in many cases. There are 23 cases 
in which the new residual is smaller, and 15 in which it is larger 
than the original residual. A more accurate comparison can be 
obtained by comparing the standard deviations of the residuals for 
the two sets. For the linear correlation, the standard deviation of 
the residuals was 3.6 bushels, whereas the standard deviations of the 
new residuals is 3.0 bushels. Apparently the new estimates do come 
nearer to the observed values, on the average, than did the first set 
of estimates. (See also Note on page 258.) 

Determining the second approximation regression curves. The 
regression curves used in constructing the estimate X'' were only the 
first approximations to the true curvilinear relations, since they were 
determined by eliminating only the linear effects of the otlier inde- 
pendent factors. Now that the residuals obtained by the use of the 
first approximation curves have been computed, however, wc can de- 
termine whether any change in the shape of the several curves is 
necessary. 

To do this we construct Figure 37 by drawing in the regression curve 
from Figure 35, using the same scale as before. Use of Table 54 makes 
it easier to reproduce the curve. Next we plot each of the last residuals 
as a deviation just as before, except that now the residuals are plotted 
as deviations from the regression curve, instead of from the regression 
line, at the point corresponding to the independent variable Thus 
the first observation, with Xo = 0, has s" = — 3.6. The point on the 
curve corresponding to Xo = 0 is 27.3; so the dot has for ordinate 
27.3 — 3.6, or 23.7. The values for next observation are X 2 = 1 and 



240 


MULTIPLE CURVILINEAR REGRESSION 


g" = 4.6. The corresponding value of (^ 2 ) is 27.8, so the ordinate 
for the dot is 27.8 + 4.6, or 32.4. The coordinates for this dot are 
therefore 1 and 32.4. The remaining observations are plotted in the 
same manner, shortening the process by scaling the value for z" di- 



Fig. 37. Time, and yield of corn adjusted to average temperature and rainfall on 
basis of first approximation curves; and second approximation to /^CATu). 

rectly above or below from the corresponding point on the regression 
curve. 

With the dots all plotted, it is evident that the scatter is too great 
to indicate definitely changes which may be needed in tlic curve, 

TABLE 58 


Avekage Values of z", for Corresponding A’‘2 Values 


X 2 values 

Number of cases 

Averngo of A ' 2 

Av(M’Jlg(‘ of z" 

0- 7 

8 

3.5 

4-0.10 

8-15 

8 

11.5 

4-0.16 

16-23 

8 

19.5 

-0.08 

24-31 

8 

27.5 

4-0.64 

31-37 

C 

34.5 

- 1.08 


if any, simply from the dots alone. Accordingly the residuals are 
averaged in groups, employing the same grouping as before (Table 52), 
which eliminates the need of averaging the corresponding values 
over again. The new averages work out as shown in Table 58. 




DETERMINING “SECOND APPROXIMATION” CURVES 241 


The averages are next plotted as deviations from the first approxi- 
mation curve. They indicate that a slight raise in the lower part of 
the curve may be needed, and a downward bend toward the end. 
It appears that now that the influence of rainfall and temperature 
on yield have been more accurately allowed for, the upward trend 
with time is slightly less than it seemed before in the early years; 
and the trend seems to have turned downward toward the end of the 
series — the exact year or extent of the turn is indeterminate. A new 
curve is therefore drawn in in Figure 37, and, as it happens, a smooth, 
continuous curve can be drawn exactly through each of the first three 
group averages, but not having the extreme bend indicated by the last 
two group averages. 

The same process may now be applied to to see if any change 
need be made in the first regression curve for the change in with 
changes in that variable. This process is carried out as shown in Figure 
38, the first approximation curve being drawn in just as before, using 
the data given in Table 55. 

Instead of plotting the individual residuals for each observation, as 
was just done with respect to X 2 , we may proceed at once to compute 
the average residuals for each of the groups of values of X 3 , since it is 
sufficiently apparent from Figure 37 that the scatter of the individual 
observations is still too great to serve as a guide in correcting the first 
approximation curves. Averaging the residuals gives the averages 
shown in Table 59. 


TABLE 59 

Average Values of z'\ for Corresponding Xu Values 


A'a vnluns 

Number of 

ensew 

.^venipfo of 
A'a 

AvercRC of 
2“ 

Avorjige of 
A, 

Average of 

Under 8.0 

4 

7.30 

4-0.58 



8.0 - 0.9 

10 

9.19 

4-0.03 



10.0-10.9 

8 

10.36 

d 

1 



11.0-11.9 

5 

11.40 

+0.48( 

10.75 

4-0.09 

12.0-13.9 

8 

12.70 

-1.11 1 


r\ 0/1 

14.0 nnd over 


15.00 

+ 1.731 

lo 00 

— U. o‘± 


Again the averages are somewhat irregular when plotted, so the last 
four groujis are rediic(‘d to two, and the new averages plotted and indi- 
cated separately. The number of observations represented by each of 
the first set of averages is indicated next to it, so that averagers based 



242 


MULTIPLE CURVILUSTEAE REGRESSION 


on a small number of observations will not be given undue weight in 
drawing in the curve. It might be desirable in some cases, also, to try 



Fig. 38. Rainfall, and yield of corn adjusted to average temperature and time on 
the basis of first approximation curves; and second approximation to fniXa). 

regrouping the cases into different groups — say froiii 8.5 to 9.4, 9.5 to 
10.4, etc. — and see if that would change at all the indications as to the 





Fig. 39. Temperature, and yield of corn adjusted to average rainfall and l ime on 
the basis of first approximation curves; and second approximation to /,i(A',i). 

shifts needed in the first curve. Working that out in this case, the 
changes needed are still found to be about the same as sliown by the 





ESTIMATING Xi FROM SECOND CURVES 


243 


group averages in Figure 38, though somewhat less regular, owing to 
the smaller size of groups. A new curve is then drawn in freehand, 
as indicated by the group averages, rising somewhat higher than for- 
merly at both ends, and not rising quite so high in the central portion 
as before. 

Turning to the relation between and X 4 , the first approximation 
curve for (X 4 ) is reproduced in Figure 39, using the values given in 
Table 56. The next step is to average the values of 2 " for correspond- 
ing values of X 4 . Using the same groupings used in Table 53, we 
arrive at the averages shown in the following table: 

TABLE 60 


Average Values of for Corresponding ^"4 Values 


X 4 values 

Number of cases 

Average of Ah 

Average of z” 

Under 72.0 

4 

71.08 

+ 1.15 

72.0-72.9 

5 

72.58 

-0.36 

73.0-73.9 

6 

73.36 

-0.58 

74.0-74.9 

10 

74.30 

-0.86 

75.0-75.9 

7 

76.33 

+0.63 

76.0-76.9 

5 

76.44 

+0.90 

77.0 and over 

2 

78.00 

-0.05 

76 . 0 and over 

7 

76.89 

+0.63 


Plotting these new averages, and connecting them by a broken line, 
we see that the relation of yield to temperature may be quite differ- 
ent from the way it appeared on the first approximation. Apparently 
the highest yields are obtained around 75 to 76 degrees, instead of at 
74 degrees; higher temperatures appear to reduce the yield markedly, 
but lower temperatures have only a slight influence on the yield. 
These indications are all within the theoretical limitations on the shape 
of the curve, as stated on page 226. Tlie new curve, drawn in free- 
hand so as to pass as nearly through these new averages as possible 
and still maintain a smooth continuous shape, with only a single maxi- 
mum, expresses these relations. 

E dimating Xi from the second approximation curves. Now that the 
second approximation curves have been determined for each variable, 
we can proceed to estimate values of Xi on the basis of the revised 
curves, to see whether the new curves enable us to estimate Xi any 




244 


MULTIPLE CURVILINEAR REGRESSION 


more accurately than the first set of curves did. To facilitate the process 
we first construct tables iov S 2 fsiXs), and/ 4 (X 4 ), showing the 

readings for the functions from the revised curves. 

TABLE 61 

Values op Zi Corbesponding to Given Values of Xz, from the Second 
Approximation Curve 


Xi 

fkXi) 

Xi 

fi^Xi) 

Xi 

fi(.Xi) 

^2 

f'kXi) 

0 

27.4 

10 

31.0 

20 

32.7 

29 

33.6 

1 

27.9 

11 

31.2 

21 

33.0 

30 

33.5 

2 

28.4 

12 

31.4 

22 

33.2 

31 

33.4 

8 

28.8 

13 

31.6 

23 

33.3 

32 

33.2 

4 

29.2 

14 

31.8 

24 

33.4 

33 


5 

29.5 

15 

32.0 

25 

33.5 

34 

32.8 

6 

29.8 

16 

32.1 

26 

33.6 

35 

32.6 

7 

30.2 

17 

32.3 

27 

33.7 

36 

32.4 

8 

30.4 

18 

32.5 

28 

33.7 

37 

32.2 

9 

30.7 

19 

32.6 






To simplify the calculations, 20 is subtracted from each of the 
functional values in making subsequent entries. The computations to 
determine the estimated values are then carried out as shown in detail 

TABLE 62 

Values op Xi Corresponding to Given Values op Xa, from tufj Second 
Approximation Curve 


Xi 

fkXi) 

Xi 

/aXi) 

^3 

/3(A'3) 

X, 

fi^'i) 

6.8 

25.5 

9.5 

31.5 

10.8 

33.3 

12.9 

33.0 

6.9 

25.7 

9.6 

31.7 

11.0 

33.4 

13.0 

33.0 

7.7 

27.6 

9.9 

32,2 

11.3 

33.4 

13.6 

32.8 

7.8 

27.8 

10.0 

32.3 

11.5 

33.3 

13.9 

32.7 

8.0 

28.2 

10.1 

32.5 

11.6 

33.3 

14.1 

32.7 

8.7 

29.9 

10.4 

32.9 

12.0 

33.2 

16.2 

32.2 

9.3 

31.1 

10.6 

33.1 

12.1 

33.2 

16.5 

32.1 

9.4 

31.3 

10.7 

33.2 

12.5 

33.1 




in Table 64 following, just as for Table 57. In practical eominilation 
these entries, for the second approximation curves, would be made on 
the same sheet as were the entries in Table 57 for tlie first appi-oxima- 







ESTIMATING Xi FROM SECOND CURVES 


245 


tion curves, thus eliminating the work of entering the values of X 2 , 
X 3 , and Z 4 over again. 

Table 64 is worked out just as was Table 57. Thus the data for the 
first observation show values of 0, 9.6, and 74.8 for X 2 , X 3 , and Z 4 , 
respectively. Looking up the corresponding values in Tables 61, 62, 
and 63 gives values of 27.4, 31.7, and 32.3, for the three functional 
values. Subtracting 20 from each value, to reduce the subsequent 
clerical work, we enter 7.4, 11.7, and 12.3 in the functional columns. 

TABLE 63 

Values of Xi Corresponding to Given Values of X4, from the Second 
Approximation Curve 




m 


Xi 

fi(.Xi) 

^4 

fiiXi) 

69.9 

31.6 

73.0 

32.0 

74.2 

32.2 

76.7 

32.2 

71.0 

31.7 

73.2 

32.0 

74.3 

32.2 

75.8 

32.2 

71.5 

31.8 

73.3 

32.0 

74.6 

32.2 


32.1 

71.9 

31.8 

73.6 

32.1 

74.8 

32.3 

76.2 

32.0 

72.0 

31.8 

73.7 

32.1 

75.0 

32.3 

76.9 

30.7 

72.6 

31.9 

74.0 

32.2 

75.2 

32.3 

77.6 

29.1 

72.8 

72.9 

32.0 

32.0 

74.1 

32.2 

75.3 

32.3 

78.4 

27.3 


The three functional values are then added, and the sum entered in th(^ 
seventh column. The entries for the functional readings are completc^d 
as shown, and the sum computc^d for each observation. Thi'ii tlie 
average of the seventh column is dc'terminod, giving the value 35.30. 
As the average of Xi is 31.916, the value of the new constant, ai. 234 , ia 
found by equation (58) to be 

«i.23i = 31.916 - 35.300 
= - 3.384 

Accordingly, 3.4 is subtracted from each of the values in colunm 7 to 
give the estimated value of Xi, X'lj whicli is then ent(n'(ul iii the (‘ighth 
column of Table 64. 

The final st(‘p in computing the table is to subtract ('ach of tli(^ (esti- 
mated values, X"i^, from the actual value Xi, giving the n'siduals 2 ;"', 
which appear in the last column. 

Comparing the new residuals, 2 "', with the previous ones, z", given 
in Table 58, we find that the size of the residuals has been increased in 







246 MULTIPLE CURVILINEAE REGRESSION 

just about as many cases as it has been decreased. But when we com- 
pute the standard deviation of the new residuals, we find that the 


TABLE 64 

Computation of Functional Values, prom the Second Approximation Curves, 
Corresponding to Independent Variables for Each Observation, and 
Computation of Estimated Value for and of New Residuals 


Independent 

variables 

Correspondin 
functional valu< 

g 

38 * 

fiiXz ) 

^ fziXz ) 

\- fi ( XA ) 

] 

( 7 ) -a 

Depend- 

ent 

variable 

Xi 

Xi - X'C 

z "' 

Xi 

Xz 

Xi 



/iW ■ 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

(9) 

(10) 

0 

9.6 

74.8 

7.4 

11.7 

12.3 

31.4 

28.0 

24.6 

-3.5 

1 

12.9 

71.6 

7.9 

13.0 

11.8 

32.7 

29.3 

33.7 

4.4 

2 

9.9 

74.2 

8.4 

12.2 

12.2 

32.8 

29.4 

27.9 

-1.5 

8 

8.7 

74.3 

8.8 

9.9 

12.2 

30.9 

27.6 

27.6 

0 

4 

6.8 

75.8 

9.2 

5.5 

12.2 

26.9 

23.6 

21 7 

-1.8 

5 

12.5 

74.1 

9.5 

13.1 

12.2 

34.8 

31.4 

31.9 

0.5 

6 

13.0 

74.1 

9.8 

13.0 

12.2 

35.0 

31.6 

36.8 

5.2 

7 

10.1 

74.0 

10.2 

12.5 

12.2 

34.9 

31.5 

29.9 

-1.6 

8 

10.1 

75.0 

10.4 

12.5 

12.3 

35.2 

31.8 

30.2 

-1.6 

9 

10.1 

75.2 

10.7 

12.5 

12.3 

35.5 

32.1 

32.0 

-0.1 

10 

10.8 

75.7 

11.0 

13.3 

12.2 

36.5 

33.1 

34.0 

0.9 

11 

7.8 

78.4 

11.2 

7.8 

7.3 

26.3 

22.9 

19.4 

-3.6 

12 

16.2 

72.6 

11.4 

12.2 

11.9 

35.5 

32.1 

36 . 0 

3.9 

13 

14.1 

72.0 

11.6 

12.7 

11.8 

36.1 

32.7 

30 , 2 

-2.5 

14 

10.6 

71.9 

11.8 

13.1 

11.8 

36.7 

33.3 

32 . 4 

-0.9 

15 

10.0 

74.0 

12.0 

12.3 

12.2 

30.6 

33.1 

36 . 4 

3.3 

16 

11.5 

73.7 

12.1 

13.3 

12.1 

37.5 

34.1 

36 . 9 

2.8 

17 

13.6 

73.0 

12.3 

12.8 

12.0 

37.1 

33.7 

31.5 

-2.2 

18 

12.1 

73.3 1 

12.5 

13.2 

12.0 

37.7 

34.3 

30.5 

-3.8 

19 

12.0 

74.6 

12.6 

13.2 

12.2 

38.0 

34.0 

32.3 

-2.3 

20 ' 

9.3 

73.6 

12.7 

11.1 

12.1 

35.9 

32.5 

34.9 

2.4 

21 

7.7 

76,2 

13.0 

7.5 

12.0 

32.5 

29.1 

30.1 

1.0 

22 

11.0 

73.2 

13.2 

13.4 

12.0 

38.6 

35.2 

30.9 

1.7 

23 

6.9 

77.6 

13.3 

5.7 

9.1 

28.1 

24.7 ' 

26.8 

2.1 

24 

9.5 

76.9 

13.4 

11.5 

10.7 

35.6 

32.2 

30.5 

-1.7 

25 

16.5 

69.9 

13.5 

12.1 

11.6 

37.2 

33.8 

33 . 3 

-0.5 

26 

9.3 

75.3 

13.6 

11. 1 

12.3 

37.0 

33.6 

29 . 7 

-3.9 

27 

9.4 

72.8 

13.7 

11.3 

12.0 

37.0 

33 . 6 

35 . 0 

1.4 

28 

8.7 

76.2 

13.7 

9.0 

12.0 

35.6 

32.2 

29 . 9 

-2.3 

29 

9.5 

76.0 

13.6 

11.5 

12.1 

37.2 

33 . 8 

35 . 2 

1,4 

30 

11.6 

72.9 

13.5 

13.3 

12.0 

38.8 

35 . 4 

38 , 3 

2.9 

31 

12.1 

76.9 

13.4 

13.2 

10.7 

37.3 

33.9 

35 . 2 

1.3 

32 

8.0 

75.0 

13.2 

8.2 

12.3 

33.7 

30.3 

35 . 5 

5.2 

33 

10.7 

74.8 

13.0 

13.2 

12.3 

38.6 

35 . 1 

36.7 

1.6 

34 

13.9 

72.6 

12.8 

12.7 

11.9 

37.4 

34.0 

26.8 

-7.2 

35 

11.3 

75.3 

12.6 

13.4 

12.3 

38.3 

34.9 

38.0 

3.1 

36 

11.6 

74.1 

12,4 

13.3 

12.2 

37.9 

34.6 

31.7 

-2.8 

37 

10.4 

71.0 

12.2 

12.9 

11.7 

36.8 

33.4 

32.6 

-0.8 

Totals. . 



. 447.6 

445.1 

448.7 

1341.4 





* Less 20 . 0 for each functional reading. 



STATING THE PINAL CONCLUSIONS 


247 


standard deviation of 2 :"^ is 2.80 bushels, or slightly smaller than the 
standard deviation of 2 ", 3.0 bushels. (See Note on page 258.) 

Correcting the curves by further successive approximations. The 
process ordinarily would be carried through one or more additional 
approximations by repeating the steps shown. Thus the last residuals, 
2 "'', when averaged and plotted with respect to the second set of approxi- 
mation curves, would indicate whether any further modifications were 
needed in the curves; if any were made, new readings would be made 
from the new curves, new estimates of Xi obtained from them, and 
another set of residuals determined. So long as the standard deviation 
of each new set of residuals is smaller than that of the previous set 
(and no more complicated curves were drawn in, which would require 
more constants to represent them), the approximation curves may 
be regarded as approaching closer and closer to the underlying true 
curves. When, however, the curves have been determined as closely 
as is possible from the given data, the standard deviation of the 
residuals will show no further decrease and may even increase slightly. 
In such case the set of curves showing the lowest standard deviation 
of residuals (and yet conforming to the hypothetical limitations) may 
be regarded as the final curves determined by the process.^^ 

We can make a check on the slope and amplitude of the final 
curves by the method of least squares, using the supplementary methods 
set forth in pages 401 to 403 of Chapter 22. Or if it is desired to have 
a mathematical expression of the several curves, equations may be 
selected capable of representing the several curves whose shape has been 
determined by the graphic successive approximation process, fitting the 
mathematical curves according to the methods presented briefly earlier 
in this chapter, on pages 221 and 222, and described in more detail 
in the first section of Chapter 22. 

Stating the final conclusions. After the final shape of the several 
net regression curves has been determined, it still remains to state 
those curves in such shape that their meaning is perfectly clear. The 
several functions may be stated to show the value of the dependent 
factor associated with given values of the particular independent fac- 
tor when values of other independent factors are held at their mean. 
There are two alternative ways of stating the associated values: (1) 
as actual values and (2) as deviations from the mean values. 

^^In very exact work, the effect upon the residuals of modifications in each 
curve separately might be tested after this point, to insure that each individual 
regression curve had been fitted to the data with the greatest degree of accuracy. 



248 


MULTIPLE CURVILINEAR REGRESSION 


To state the associated values as actual values, we may use the 
following procedure: 

First, the mean of all the values read from the final curve is de- 
termined. For /2(-X'2)j mean may be designated M/(2). The 
values from the curve are read off for selected intervals of X2* Then 
the estimated values of for each of these values of X2 (with values 
of X3, Z4, etc., at their means) are determined by subtracting the mean 
of the curve readings from each of these actual readings and adding to 
the result the mean of Xj. That is, if we use X^ = F2{X2) to desig- 
nate these values of Xi, estimated from the net curvilinear relation to 
X2, we can define them by the equation 

x; = F2{X2) = /2(X2) - Mf^2) + Ml (60) 

If, however, the expected values of Xi for given values of Xo are 
to be stated merely as deviations from the mean values, those devia- 
tions may be determined by subtracting from each curve reading the 
mean of all the curve readings. If we use ^2(2:2) to designate these 
expected deviations from the mean values, we may define them by 
the equation 

x[ = -^2(0:2) =/2(X2) — M’/(2) ( 61 ) 

It is evident, from equations ( 60 ) and ( 61 ), that 

F2{X2) = F2{x2) + Ml 

In the actual statement of the results of a correlation study, it is fre- 
quently desirable to state the relation of the dependent factor to tlio 
most important independent factor according to equation ( 60 ), and to 
state the relation for the remaining independent factors acaairding to 
equation ( 61 ). When that is done, the estimated values of based 
on all the independent factors, may be readily computed by taking the 
estimate from the most important factor, and then adding to or sub- 
tracting from that the corrections to take account of the (h'partiires of 
other factors from their means. Using X[ to designate this final esti- 
mate of the value A^i , and taking X3 as the most important factor, we 
make the estimate by the equation 

Xi = ^2(0:2) + F^s(X^) -h f^ 4 (.r 4 ) + • • • + Fjt(xu) ( 62 ) 

The process of working out these final statements of the net curvi- 
linear regression lines may be illustrated by the data of the corn-yield 
problem. Since the rainfall (A^o) was apparently the most important 
factor, that may be taken as the one for which the regression is to be 



STATING THE FINAL CONCLUSIONS 


249 


stated according to equation (60) . If we regard the second approxima- 
tion curve shown in Figure 38 and Table 62 as the final curve, then 
Table 64 gives the readings from this curve for each of the individual 
observations. 

The mean of the readings of fsiX^) is next computed from the 
values of Table 64. The sum of the 38 /"(X 3 ) readings is 445.1, so 

445 1 

MfiXs) = = 11.71 


The mean value of Xi is Mi = 31.92. From equation (60), 

Fz(Xz) = h(Xs) - + Ml 

which is 

- h{X^) - 11.71 + 31.92 

= / 3 (^ 3 ) + 20.21 

All that is necessary, therefore, is to add the new constant, 20.2, to 
the values read from the curve. This process is shown in Table 65. 

TABLE 65 

Computation of Average Yield op Corn With Varying Rainfall, Holding 
Trend in Yield and Influence op Temperature Constant 


Inches of rainfall, 
As 

Readings from 
final curve,* 
/i'CA'a) 

Constant, 

Ml — 

Average yield, 
^’ 3 (^ 3 ) 

7 

0.0 

20.2 

20.2 

8 

8.2 

20.2 

28.4 

9 

10.5 

20.2 

30.7 

10 

12.3 

20.2 

32.5 

11 

13.4 

20.2 

33.0 

12 

13.2 

20.2 

33.4 

13 

13.0 

20.2 

33.2 

14 

12.7 

20.2 

32.9 

15 

12.5 

20.2 

32.7 

10 

12.3 

20.2 

32.5 


* Ciii v<! readiiiKM nuiiu.s 20, jiitst, aw in Tabic (il. 


The computation for F^{x^) follows the same form as that for 
F 3 (Y 3 ), save that equation (61.) is used instead, and hence the mean 
of Xi is not involved. First the mean of all the readings for ^(^^ 4 ), 



250 


MULTIPLE CURVILINEAR REGRESSION 


as shown in Table 64, is computed, giving the value of 11.81. The 
values for F^ix^) are therefore given by the equation 

F^{xi) =/;'(X4) - AT/cx,) 

=/;'(X4) - 11.81 

These values are worked out in Table 66. 

TABLE 66 

COMPUTATIOH" OF DEVIATION OP CORN YiELDS FROM YiELDS OTHERWISE EXPECTED, 
Because of Differences in Temperature for Season 


Average 

temperature, 

X 4 

Readings from 
final curve, * 

1 

Constant, 

Correction to 
expected yield, 

70.0 

11.6 

- 11.8 

-0.2 

71.0 

11.7 

- 11.8 

-0.1 

72.0 

11.8 

-11.8 

0 

73.0 

12.0 

-11.8 

0.2 

74.0 

12.2 

-11.8 

0.4 

75.0 

12.3 

-11.8 

0.5 

76.0 

12.1 

-11.8 

0.3 

77.0 

10.5 

-11.8 

-1.3 

78.0 

8.3 

-11.8 

-3.5 


* Curve readings minus 20, just as entered in Table 64. 


The net correction in the estimated yield to allow for the influence 
of trend can be obtained by carrying through a similar computation for 
^ 2 fe)* The readings for/ 2 (X 2 ) sum to 447.6, so M/( 2 ) = 11.78. The 
values of F 2 (^ 2 ) given by the equation 

^ 2 ( 0 : 2 ) = /2 (X 2 ) - 11.78 

This computation is carried out in Table 67. 

The conclusions of the study can then be stated as shown in tlie 
last column of each of the last three tables, free from all the jirevious 
details. 

The relations for each of the variables can also be combined to 
show the expected or estimated yield for various combinations of i,lie 
independent factors. Thus for the present case, it might be desired 
to combine the findings into a table showing the expected or ]irobable 
yield for any given combination of rainfall and temperature, with 



STATING THE FINAL CONCLUSIONS 


251 


the 1927 trend of yield. These values can be obtained by taking the 
trend correction for 1927, +0.4, and combining it with the estimated 

TABLE 67 

Computation of Deviation op Corn Yields from Those Otherwise Expected, 
Because of Net Trend in Yields 


Number of year, 
X 2 

Date 

Readings from 
final curve,* 

f2\X2) 

Constant, 

Correction to 
expected yield, 
F2(X2) 

0 

1890 

7.4 

-11.8 

-4.4 

5 

1895 

9.5 

-11.8 

-2.3 

10 

1900 

11.0 

-11.8 

-0.8 

15 

1905 

12.0 

-11.8 

0.2 

20 

1910 

12.7 

-11.8 

0.9 

25 

1915 

13.5 

-11.8 

1.7 

30 

1920 

13.5 

-11.8 

1.7 

35 

1925 

12.6 

-11.8 

0.8 


* Curve roiidiiigs iniaus 20. 


influence of various quantities of rain and degrees of temperature. 
These estimates would then be defined by the equation 


x; = F2(x2) + F^iXs) + F^(x^) 

= 0.4 + F,(Xs) + F^ix^) 

Combining the readings for F^iX^) from Table 65 with those for 
F^ix^) from Table 66, and adding in the correction for 7^2 (^ 2 ) just 
stated, we obtain estimated yields as shown in Table 68.^° 

In preparing a table such as Table 68, we should not enter values for 
combinations of the several factors which were not represented in the 
data on which the relations were based. Examination of a dot chart 
of the relation between rainfall and temperature, for the data included 
in the analysis, shows that no combinations of rainfall below 9 inches 
and temperature below 74° appeared in the record, and no cases of 
temperature above 78° with rainfall above 9 inches occurred. Ac- 
cordingly, these combinations, and other combinations which were not 
represented, are left blank in the table, as shown. f.A more exact 

Table 68 may be compared with the resnlhs snciired by cross-classifying and 
>».veraging the same data, by the methods of Chapter 11. 





252 


MULTIPLE CUEVILINEAR REGRESSION 


method for measuring the representativeness of the relations is referred 
to in Chapter 19, on page 349.) 

By combining a table such as Table 68 with a statement of the 
extent to which yields averaged higher or lower than those shown at 
different times through the period, all the conclusions from the study 
can be presented in simple form, easy to understand. 


TABLE 68 

Estimated Yield op Corn, in Bushels per Acre, With Varying Rainfall and 
Temperature Conditions, for 1927 


Inches of 
rainfall * 

Average temperature t 

70° 

72° 

74° 

76° 

oo 

7 

t 

t 

27.0 

26.9 

23.1 

9 

30.9 

31.1 

31.5 

31.4 


11 

33.8 

34.0 

34.4 

34.3 

t 

13 

33.4 

33.6 

34.0 



15 

32.9 

33.1 


t 



* Total for June, July, and August; average for 9 Corn-Belt atations. 
t Average for June, July, and August, at same 9 stations. 

t This combination of factors was not represented in the observations analyzed. 


The final results of curvilinear correlation studies, after being 
simplified to the form shown in Tables 65 to 67, or in Table 68, may 
also be expressed graphically for final publication. Thus all tlircc 
relations might be combined into a single figure, sucli as shown in 
Figure 40, to present in relatively simple form the final conclusions 
reached by the statistical analysis.^^* 

It might be noted at this point that Table 68 is much more than 
merely a table of average yields for various rainfall and temperature 
groups. There were only 38 observations to begin with, and only 14 
of those were under 74 degrees temperature. If these 14 ohserva' 
tions had been grouped according to year and rainfall, and the av(‘rage 
yield determined for each class, only the roiigliest sort of groups 
could have been made, and even then the averages would liave had but 
little reliability. As the result of the correlation study, however, all 38 
observations have been drawn on to determine the relations. The table 
shows the yield most likely to be received with any of 16 different 

A three-dimensional chart illustra.ting Table 68 is shown on piigo 37:>. 



STATING THE FINAL CONCLUSIONS 


253 


combinations of rainfall and temperature, for the trend in 1927. 
Other estimates could be shown for a large number of other combina- 
tions. Furthermore, it is known that estimates made from such tables 
agreed with the actual yields to within 2.8 bushels in about two-thirds 
of the original cases. The reliability of these estimated yields is thus 
greater than it would be for any average of a few cases alone. This 
example illustrates the ability of correlation analysis both to bring 
out of a series of observations relations which are not observable 



Correction 

for 

temperature 
Bushels 
♦ 2 


0 

-2 


70 


1 1 1 1 


AUowance 


'for temperature^ 


72 


7 ^ 


76 


78 


Temperature, in degrees 

Correction 

for 

year 

Bushels 


r— 




^A//owance for trend 



1 in 'ytetd \ 


1890 1900 1910 1920 1930 


Fig. 40. Kelation of yield of corn to rainfall, temperature, and time. 


on the surface and to provide a basis for estimating the probable 
effect on the dependent factor of new combinations of the independent 
factors. 

In this particular case the final shapes of the regression curves 
showing the net differences in yield with differences in rainfall and 
time are not greatly different from those indicated by simple corre- 
lation. In some cases, however, the final shape of the curves may be 
markedly different from the apparent si i ape before the variation 
associated with other factors has been eliminated. Thus the final 





254 


MULTIPLE CURVILINEAR REGRESSION 


shape of the curve showing the net differences in yield with differ- 
ences in temperature, after allowing for the influence of rainfall and 
time, is quite different from what might have been expected from 
the original observations, as is illustrated in Figure 41. The curvi- 
linear net regression is also quite different from the linear net regres- 
sion, indicating that 74 to 76 degrees is the optimum temperature, 
whereas the straight line indicated that the lower the temperature, 
the higher the yield. With multiple correlation, as with simple corre- 
lation, the determination of the regression curves makes the results 
much more definite, adequate, and usable than does merely the deter- 
mination of the linear regressions. 

Yield of Corn 



I’ki. 41. Comparison of apparent relation of corn yields to tomporature with net 
relation after eliminating influence of rainfall and of trend in yield. 

Limitations on the use of the results. It should be noted that 
the results of the corn-yield analysis apply only to the same area from 
w'hicli the data were drawn and to the period which tlicy covered. 
Tims they provide no basis for estimating com yields in other sections, 
and their use in estimating yields in other periods — as in subsequent 
y(,ars — is attended by increasing risk due to the necessity of extrapolat- 
ing the trend regression. Although this may give fair results for a 
year or two, as has been illustrated, it may tend to become increasingly 
inexact. For example, it may be that the trend of yield did not really 
turn downward about 1920, but only flattened out — additional years of 
observations will be needed really to tell which is correct. 

Other multiple curvilinear correlation studies illustrate other limi- 
tations to the application of the results secured. Thus in a study 
of the price of eggs in New York City, records were secured during 



A TEST IN ACTUAL FORECASTING OF YIELD 


255 


a period of a few days on the retail sales price of each of a number 
of dozens of eggs, and of the size, color, and quality of the eggs. 
(The data are given in the problem in Chapter 17.) By determining 
the net regression of price upon each of the factors, using the method 
illustrated, the net change in egg prices with changes in each of 
these factors can be determined. But it is readily apparent that size, 
quality, and color are not the only factors which might cause egg 
prices to vary. Prices change from one time of year to another, 
because of changes in seasonal demand, in supplies on the market, 
and in response to other factors as well. Prices also vary from place 
to place on the same day, and even at different stages in the market- 
ing process in the same city on the same day — between sales at whole- 
sale and retail, for instance. When we say that the results of the egg- 
price study enabled us to estimate egg prices to within five cents 
two-thirds of the time, it must be remembered that the statement 
holds true only for the same universe from which the original samples 
were selected. In this case the samples were all selected from sales 
at retail, in the New York metropolitan area, in a particular period. 
The results therefore apply only to the reasons for variations in egg 
prices between particular stores, in that particular city, in that par- 
ticular period. They might indicate the effect of similar differences 
in quality or weight on prices from store to store in the same city at 
other times of the year, or in other cities; but we could not be certain 
of that from this material alone. Other studies, covering those other 
“universes,” would be needed to prove or disprove that supposition; 
for the conclusions, of and by themselves, offer no statistical evidence 
except for their own particular “universe.” For that reason, each of 
the final tables should indicate clearly the conditions to which its con- 
clusions ap])ly and thus definitely limit the statistical statement of the 
results to the particular conditions which they really represent. 

A test in actual forecasting of yield. The two preceding paragraphs 
stand exactly as they were written in 1929. Now that this book is 
being revised (in 1941) the regressions based on the period from 
1890 to 1927 can be given a severe test, by using them to estimate 
the yields during the subsequent 12 years. The necessary data for this 
estimate arc given in Table 68A. 

Estimates of yield for each of these years, according to the final 
curvilinear regressions shown in Tables 65 to 67 and Figure 40, are 
given in Table 68B, together with the residuals. 

The new years included years of weather conditions more extreme 
than any experienced in the base years. It was, therefore, necessary to 



S56 


MULTIPLE CURVILINEAR REGRESSION 


extrapolate the earlier curves in making the estimates. This was 
done by extending them with the same slope or curve as in the adjacent 
portions of the curve determined from the earlier data. 


TABLE 68A 

Yield of Corn, Rainfall, and Temperatures in Six Leading States, 1928 to 1939 


Year 

Time 

Xi 

Rainfall 
in inches 

As 

Tempera turo 
in degrees 

A’4 

Actual yield 
in bushels 
Ai 

1928 

38 

15.1 

72.8 

33.4 


39 

10.6 

73.4 

31.6 

1930 

40 

6.4 

76.4 

26.8 

1931 

41 

10.4 

76.9 

32.7 

1932 

42 

13.5 

76.0 

35.4 

1933 

43 

7.2 

77.3 

! 29.4 

1934 

44 

7.5 

80.0 

18.9 

1935 

45 

9.6 

76.2 

31.7 

1936 

46 

4.9 

80.0 

18.5 

1937 

47 

10.1 

76.6 

36.4 

1938 

48 

12.6 

76.3 

35.9 

1939 

49 

1 

11.7 

75.8 

41.1 


Source: Computed from June, July, and August recorda for uino woatlier atationa in Corn Belt 
states. Stations averaged inchide Kansas City, St. I.ouis, Toledo, Omaha, Peoria, Cincitmati, 
Topeka, Indianapolis, and the Iowa state average, as in the original study. 


It is evident that the regressions gave fairly good estimates for the 
first few years of extraioolation, but thereafter gave increasingly large 
underestimates of the yield. It would appear that the introduction of 
hybrid seed corn, the possible improvement of cultivation with better 
machinery, the increase of soil fertility and the restriction of corn 
to the better fields with acreage-limitation and soil-conservation pro- 
grams after 1933, and other factors, all combined to ]iroducc a new 
“universe/' in which the corn yield to be expected for a given combina- 
tion of weather became progressively higher than it had been in 
earlier years. Also, extremes of weather not previously experienced 
(such as the combination of an average temperature of S0° with a 
rainfall of 4.9 inches in 1936), which lay far outside the ]n’evious 
observations, apparently produced results somewhat different from 
those in the years analyzed. Even so, the estimates for the years of 
extreme conditions (1934 and 1936) were not extremely in error, as 



A TEST IN ACTUAL FORECASTING OF YIELD 


257 


contrasted to other years in the last five. The doubts as to the cor- 
rectness of the trend, as expressed in 1929, have been clearly confirmed 
by the subsequent data. 

These actual results of extrapolation of a regression formula indi- 
cate the way that the conditions of a universe may shift and show the 
need of recalculating forecasting formulas for time series every year or 
two, to make sure that they are still applicable. 


TABLE 68B 

Yield Estimated by Curvilinear Regressions on Three Factors, 1928 to 1939 


Year 

F 2 (^ 2 ) 


F 4 M 

x'i 


z" 




Xi — Xi 

1928 

0.2 

32.7 

0.2 

33.1 

33.4 

0.3 

1929 

0 

33.4 

0.3 

33.7 

31.5 

-2.2 

1930 

-0.2 

24.8 

-0.3 

24.3 

25.8 

1.5 

1931 

-0.4 

33.2 

-1.1 

31.7 

32.7 

1.0 

1932 

-0.6 

33.0 

0.3 

32.7 

35.4 

2.7 

1933 

-0.8 

26.7 

-1.0 

24.0 

29.4 

5.4 

1934 

-1.0 

27.4 

-9.0 

17.4 

18.9 

1.5 

1935 

-1.2 

31.8 

0 

30.6 

31.7 

1.1 

1936 

-1.4 

21.0 

-9.0 

10.6 

18.6 

7.9 

1937 

-1.6 

32.7 

-0.5 

30.6 

36.4 

5.8 

1938 

-1.8 

33.3 

-0.1 

31.4 

35.9 

4.5 

1939 

-2.0 

33.5 

0.4 

31.9 

41.1 

9.2 


The residuals for the first six years have a root-mean-square error 
of 2.7 bushels. This compares well with the standard 




deviation of 2.8 for the estimates for the 38 years included in the 
study. The next six years, however, had a root-mean-square error 
of 5.8 bushels. Since these latter errors were all in the same direc- 
tion, the shift in the trend would appear to be primarily responsible 
for this increased unreliability. 

(For an exercise in curve fitting by this method, the student can 
fit a set of regressions to tlie data for the whole period 1890 to 1939. 
Also, it would he valuaVile to fit separate regressions for the periods 
1890 to 1920, and 1910 to 1940, and compare the two sets of results. 
Do they show a significant change in the relation of yields to the three 
factors?) 


258 


MULTIPLE CURVILINEAIl EEGRESSION 


Reliability of Regression Curves 

The regression curves show the net relation between the dependent 
variable and each independent variable, with the net variation asso- 
ciated with the other independent variables held constant, for the 
particular observations included in the sample. If another sample 
were drawn from the same universe, and similar net regression curves 
were determined, they would vary somewhat from the curves deter- 
mined from the first sample. The lower the multiple correlation in the 
universe, or the smaller the sample, the larger would be this variation 
between successive samples. Methods have been developed for estimat- 
ing the proportion of such samples which will give regression results 
falling within given ranges of the true regressions prevailing in the 
universe. (See Chapter 18, pages 327 to 340.) In publishing regres- 
sion results, as shown in Tables 65 to 68, or in presenting charts of the 
regression results, such as shown in Figure 40, the reliability range of 
the regressions should be indicated, as shown subsequently. Even if 
the regressions (as in the example here) are determined from a time 
series, and so are based upon all the evidence for that portion of the 
constantly evolving universe, the reliability limits may still be used 
as an indication of possible significance, in view of the closeness 
with which the relations can be determined. (For a more extended 
discussion of the meaning of sampling errors with respect to time, see 
Chapter 19, pages 349 to 356.) 

Summary. In this chapter methods of determining curvilinear mul- 
tiple regressions have been discussed. These show the extent to which 
changes in the dependent variable are associated with changes in each 
particular independent variable, while simultaneously removing that 
part of the variation in the dependent variable which is associated 
(linearly or curvilinearly) with other independent variables. A 
method of determining the curves by successive graphic approxima- 
tions is presented step by step. Since this method does not involve 
making definite assumptions as to the final shape of the curves, it is 
to be preferred to more mathematical methods, presented in a sub- 
sequent chapter, unless there is a logical basis for the choice of specific 
functions. Methods of simplifying the conclusions for popular state- 
ment are illustrated, and the universe to which they are applicable 
is briefly considered. 

Correction Note. — On pages 239 and 247 the standard deviations of the re- 
siduals, (Tz, are used to determine whether the new regression curves show any 
gain in closeness of fit over the previous regressions. These comparisons can be 
made most accurately by using the standard errors oj estimate, adjusted for 7 i 
and m as explained on pages 208 and 261 (eqs. 42 and 65). The successive 
approximation process should be continued only until the adjusted standard 
error of estimate shows no further reduction. 



CHAPTER 15 


MEASURING ACCURACY OF ESTIMATE AND DEGREE OF 
CORRELATION FOR CURVILINEAR MULTIPLE 
CORRELATION 

In presenting linear multiple correlation it was pointed out that 
coefficients could be computed to show (1) how closely estimated 
values of the dependent variable, based on the linear regression equa- 
tion, could be expected to agree with the actual values; and (2) what 
proportion of the total observed variation in the dependent factor 
could be explained or accounted for by its relation to the independent 
factors considered. These coefficients were, respectively, the stand- 
ard error of estimate and the coefficient of multiple correlation. Ex- 
actly parallel coefficients can be computed to show the significance 
of curvilinear multiple correlation, employing curvilinear net regres- 
sions such as those discussed in Chapter 14. The term ^‘standard 
error of estimate’^ is again used to indicate the measure of the prob- 
able accuracy of estimated values of the dependent factor. In measur- 
ing the proportion of variation explained we will follow the usage 
in simple curvilinear correlation, and use the term '4ndex^’ to denote 
the fact that curvilinear regressions have been employed. The propor- 
tion of variation accounted for is therefore shown by the “index of 
multiple correlation.” 

Standard error of estimate. In working through the various 
steps in determining the net regression curves by the method of suc- 
cessive approximations, in Chapter 14, the estimated values were sub- 
tracted from the actual values for each observation, and the resulting 
residual values, 2", 2'", etc., were obtained. The standard deviations 
of these residuals were used as an indication of the accuracy of estimate 
for eacli set of curves. Where a very large number of observations is 
employed, such standard deviations of the residuals may be regarded 
as an indication of the extent to which estimated values of the dependent 
variable made froin new sets of observed values drawn from the same 
universe may be expected to agree with the actual value of the depend- 
ent variable. Thus if we use Si./(2,3,4) to designate the standard error of 
estimates of Xi, made on the basis of curvilinear relations to X2, X3, 

259 



260 MEASUEES OF MULTIPLE CURVILINEAR CORRELATION 

and X^, and ^i./( 2 , 3 , 4 ) to represent the residuals obtained using the 
final curvilinear regressions to estimate the dependent factor, the 
standard error may be defined by the equation 

'Sl./(2.3.4) = <^*1^(2, 3, 4) 

If the standard error of estimate for the final regression curves for 
the egg-price problem mentioned in the previous chapter were 5 cents, 
that would mean that, if other purchases of eggs had been made in the 
same territory on the same day, it would have been possible to esti- 
mate the price to be paid for each dozen from their physical charac- 
teristics, to an accuracy indicated by that standard error. Two-thirds 
of the estimated values would probably have fallen within a range of 
5 cents of the prices actually charged. 

With the corn-yield problem, the standard deviation of the residuals 
from the last set of curves was 2.8 bushels. In this case no other 
^^sample’^ can be drawn from the same 'hmiverse^^ except those included 
in the problem, for the universe was restricted to the years studied, 
1890 to 1927. Extrapolating the trend line, however, it is fairly safe 
to say that estimates made for the same region for subsequent years 
can be expected to have a standard deviation of at least 2.8 bushels. 
If the trend used did not prove correct for subsequent years, the errors 
might be considerably larger.^ 

The relation shown in equation (63) holds exactly true only where 
there are a very large number of cases included in the sample dealt 
with. Where the sample is no larger than is usually available to the 
research worker, there is a tendency for the standard deviation of z 
to be somewhat smaller than the standard error winch would be found 
in a very large sample drawn from the same universe. The smaller 
the number of observations, the larger the number of independent 
variables included, and the more complex the curves employed, the 
greater will be the tendency for the observed standard deviation 
to underestimate the true standard error. This may be illustrated 
by results from an experimental study of the stability of multiple 
curvilinear correlation results. In this case a universe of known 
correlation was employed, and successive samples wore drawn of 
various sizes, repeating each drawing a number of times for the samples 
of each size. The curvilinear regressions were then determined for 
each sample separately by the successive approximation method, and 

^ This statement, written a decade ago, may be compared with the actual extra- 
polations made subsequently, as shown on pages 255 to 267 of the previous (rluiptor. 



STANDARD ERROR OF ESTIMATE 


261 


the standard deviations were worked out for the residuals in each case. 
The entire analysis was then repeated, employing a universe of a 
higher correlation. The central values of these standard deviations 
of the residuals, for the samples of each size, were: 


Number of observations 

Observed standard deviation of z * 

Universe 1 

Universe 2 

30 

1.95 

1.53 

50 

2.18 

1.64 

100 

2.21 

1.72 

Entire universe 

2.40 

1.80 


* These values are the median values observed. 


It is quite evident from these results that the samples tended to 
give standard deviations smaller than that which actually was true 
for the universe as a whole and, further, that the smaller the sample 
employed, the greater the overestimate of the reliability of the esti- 
mated values. 


It is therefore necessary to adjust the observed (Tz to give 3 ,etc.); 
which is an unbiased estimate of 3 , otc.) for the universe from which 

the sample was drawn. This adjustment is given in the following equa- 
tion: 


^l./(2,3,otc.) 


1 — mjn 


(64) 


or 


./(2,3,4,nl,c.) 



n — m n — m 


(65) 


Whore 71 = number of observations in the siimpJ('. 

and 7n = numlxT of constants represent'd (eithc'r mathematically 
or graphically) in the regression equation 


It will be seen that equation (65) is exactly similar to equation (42) 
for the standard error of estimate in linear multiple correlation prob- 
lems. For curvilinear problems, however, the value ni has a somewhat 
different meaning. Tlius in the experimental results just discussed, 
three independent factors WTre involved, so the regression equation 
was of the form 


= a+f2{X^^ +/ 3 (Xa) +74(^4) 




262 MEASURES OF MULTIPLE CURVILINEAR CORRELATION 


The corresponding linear regression equation would involve onh 
four constants, so m would be equal to 4. For curvilinear regressions 
however, at least two constants would be necessary to represent eacl 
regression curve, and possibly more. In the experimental study eacl 
curve had only one bend, either upward or downward. It wa; 
judged, however, that the curves could not be represented by second 
order parabolas, since their shapes did not follow the smooth syin 
metrical curve which that type of function is capable of describing 
Instead, it was judged that a third-order parabola would be neces 
sary to give a fairly satisfactory fit to each regression curve. Thi 
conclusion was therefore reached that three constants would be neces 
sary for a mathematical representation of each regression curve. Oi 
that basis the entire regression equation would represent approxi. 
mately ten constants, three for each of the three curves, and one fo; 
the value a. (See pages 76 to 81 for other types of curves.) 

Using 10 for m in equation (64), we may work out the value o 
Si./( 234 ) for the smallest sample shown in the statement on page 261 a 
follows: 


.. 7 ( 234 ) = 



(1.95)2 



3.80 

0.667 


= 5.70 


^l./(234) = 2.39 


It is evident that this corrected value is much closer to the tru 
value for the entire universe, 2.40, than was the original Htandar( 
deviation of z. 

Carrying the same adjustment through for the other values sliowi 
on page 261, we obtain standard errors of estimate as shown in i\v 
following statement. 


Number of 
observations 

n 

Value used 
for m 

Universe 1 

Uni verso 2 

Observed 

CTz 

Calculated 

Observed 

Oz 

Caloulnted 

30 

10 

1.95 

2.30 

1.53 

1.87 

50 

10 

2.18 

2.43 

1.64 

1 . 83 

100 

10 

2.21 

2.33 

1.72 

1.82 

Entire 






universe 


2.40 


1.80 









STANDARD ERROR OF ESTIMATE 


263 


The superior accuracy of the adjusted values is evident through- 
out this table — in each case they come much nearer to agreeing with 
the true value for the universe than do the unadjusted values. 

Using equation (64) to obtain the standard error of estimate for the 
corn-yield problem, we find it necessary first to decide on the value 
to use for m. That problem also employed three independent variables, 
just as did the experimental study, and the final <r« was 2.80 bushels. 
Although none of the three regression curves has more than one bend, 
none of them is of the symmetrical shape that can be described by the 
parabola; instead, at least a cubic parabola would be required to 
represent the curves for and whereas probably a 

quartic parabola, involving four constants, would be required to repre- 
sent /2 (X 2 ) with its final shape, or three constants with its first form. 
The final regression equation for corn yields might therefore be 
assumed to represent one constant for a, four for three for 

and three for j^{X^)j or a total of eleven in all. When this 
value and the number of cases are inserted, in formula (64) , it becomes 


^./(234) 


^"l./(234) 


0*2 



n 


3.32 


( 2 . 80)2 



11.03 


Although the standard deviation of the observed residuals was only 
2.8 bushels, this standard error of estimate indicates that, in using 
the results in making estimates for other years, the accuracy is likely 
to be less, even though the trend line is correctly extended. Instead 
of the estimated values probably coming within 2.8 bushels of the 
actual values in 68 per cent of the cases, they are likely to come so 
close in only about 58 per cent of the estimates, and an error of 3.3 
bushels would liave to be allowed to take in 68 per cent of the cases. 
In this particular problem, with 3 regression curves determined 
from 38 observations, the correction embodied in equation (64) is 
important. If the same set of conclusions had been obtained from 20 
observations, with the same standard deviation of the residuals, apply- 
ing the correction formula would have increased the standard error 
of estimate to above 4.1 busliels, illustrating again the tendency of a 
small sample to exaggerate the accuracy of estimate.^ 


2 Ah is indiciiicd later (Chapter 19, paj^oa 341 to 347), each individual estimate 
for a new observation has its own standard error. Those standard errors are all 
larp;er than the standiird (M-ror of estimate from the sample. The interpretation 
Riven above for the use of the standard error of estimate therefore understates the 
standard error for new observations. 



264 MEASUKES OF MULTIPLE CURVILINEAR CORRELATION 


Index of multiple correlation. The coeflScient of multiple corre- 
lation, it will be remembered, indicated the proportion of the total 
variation in the dependent factor which could be accounted for on 
the basis of the linear relations to the several independent factors. In 
exactly the same way the proportion of variation which can be ac- 
counted for on the basis of the curvilinear relations to the several 
independent factors is termed the “index of multiple correlation,^^ 
and is designated by the term P, that is, capital rho. Following the 
definition, and using XJ' to indicate values of Xi estimated from the 
other factors on the basis of the net curvilinear regressions, we may 
define the index of multiple correlation roughly by the equation 

p = ^ 
o-Xi 

It is more accurately computed, however, by making use of the 
standard deviations of the residuals. Using to represent , 

then 

p2 = 1 _ ^ (66.1) 

With small samples o-^'^ends to be smaller than the actual standard 
error of estimate in the universe as a whole. For that reason, the index 
of correlation, as computed by the formula just given, tends to exceed 
the correlation that actually obtains in the universe from which the 
observations are drawn. Data from the experiment mentioned earlier 
illustrate this point. The following tabulation shows the modal index 
of multiple correlation for the samples of each size, in comparison with 
the true index of correlation for the entire universe. 


Number of observations in sample 

Observed index of multiple correlation 
in samples drawn from same universe 

30 

0.77 

50 

0.71 

100 

0.68 

Entire universe 

0.62 


In every case the observed correlation exceeds tlie true correlation 
in the universe, and the smaller the size of the sample, tlic larger the 
difference. It is therefore necessary to apply to the index of multiple 
correlation the same type of adjustment which was applied in ol)taining 


INDEX OF MULTIPLE CORRELATION 


265 


the standard error of estimate, if unbiased estimates of the population 
value are to be obtained. This may be done either by substituting 
the adjusted standard error of estimate for the observed standard 
deviation of the residuals in the equation to determine P, or by making 
the adjustment directly in the equation itself. The following formulas 
show both methods. 


Pi. 234 


= 1 _ n — l N~[ 

Lv An-wJ 


( 66 . 2 ) 


(66.3) 


The adjusted indexes of multiple correlation work out for the 
experimental data as shown in the following statement: 


Number of 
observations, n 

Value used for m 

Crude, P 

Adjusted, P 

30 

10 

0.77 

0.64 

60 

10 

0.71 

0.63 

100 

10 

0.68 

0.65 

Entire universe 


0.62 



Here again tlie adjusted values are found to be in mucli better agree- 
ment with the true value for the entire universe than are the crude 
values. For that reason equations (66.2) or (66.3) should always bo 
employed in calculating the index of multiple correlation. 

Unless the index of multiple correlation, as calculated with the 
adjustment, is larger than the coefficient of multiple correlation, with 
its comparable adjustment by equation (47), there is no statistical 
evidence of significant curvilinearity in the regression lines. Unless 
the standard error for the curves is lower even after adjustment, any 
reduction in the unadjusted standard deviation of Zif{ 2 , 3 , 4 ), as com- 
pared with <rz from the linear regression, would be merely a fictitious 
improvement in accuracy. If we take additional variables into account, 
or use up more degrees of freedom by employing more constants in 
the curves, we obtain a certain amount of spurious increase in tlie 
apparent correlation. Correcting for n and m removes this spurious 
effect. 




266 MEASURES OF MULTIPLE CURVILINEAR CORRELATION 

Once the index of multiple correlation has been computed by 
equations (66.2) or (66.3), the square of its value may be employed to 
represent the total determination, i.e., to measure the proportion of the 
total variance in which can be accounted for on the basis of the 
curvilinear relations to the several independent factors. To maintain 
the same terminology, this may be termed the index of total determina- 
tion, to distinguish it from the coefficient of total determination, which 
applies to linear multiple correlation. 

The computation of the index of multiple correlation may now be 
illustrated from the data of the corn-yield problem.*^ In that study 
the original standard deviation of the yields was 4.30 bushels, the 
standard error of estimate by linear multiple correlation, 3.87 bushels, 
and the coefficient of multiple correlation, after adjusting for the 
number of cases, 0.49. The standard error of estimate for the final 
regression curves, as worked out on page 263, was 3.32 bushels. Com- 
puting the index of multiple correlation by equation (66.2), we have 

TS2 -i ^i./(234) 

•Ul.234 = I 2 \ / 

CTi \ n / 

(4.30)2 \38/ 

= 0.4184 
Pi. 234 = 0.65 

The index of multiple correlation is therefore 0.65, as compared 
with the coefficient of multiple correlation of 0.49. The total determi- 
nation, which was 24 per cent for the linear relation, has been raised to 
42 per cent for the curvilinear. The increase indicates tliat the linear 
relations did not express all the effect of the tliree independent 
variables, and that taking the curvilinearity of the regressions into 
account has added significantly to the importance of the factors con- 
sidered. With such a low determination, however, it is evident that 
there are other perhaps more important factors not yet taken into 
account. 

Measuring the net curvilinear importance of individual factors. 

No method has been devised as yet to determine the portion of the 
index of total determination which can be ascribed to each of the 
several independent factors, solely from the methods used in obtain- 


3 See pages 225 and 227. 



SUMMARY 


267 


ing the several regression curves themselves. The final slope and 
shape of the curves may be tested, however, by correlating the curve 
readings for each observation with the original values of the de- 
pendent factor, so as to obtain the partial regression coefiBcients indi- 
cated in equation (89) , and explained in Chapter 22. 

== a' -h i>12'.3'4/[/2(>X^2)] + ?>13/.2'4'[/3(^3)] + ^> 14 /. 2 ' 3 /[/ 4 (^ 4 )] 

If that is done, the coefficient of multiple correlation, jffii. 2 ' 3 ' 4 ', measures 
the total correlation with respect to the several curvilinear functions 
(including the final adjustments) and is therefore the index of multiple 
correlation, Pi. 234 - It is, however, still subject to the same adjustment 
for number of constants as are indexes of multiple correlation computed 
in other ways, and should therefore be corrected as follows: 

Pl.234 = 1 — (1 — I2i.2'3'4') (67) 

n — m 

Indexes of partial correlation can be determined with respect to 
the curvilinear regressions of the several independent variables, as 
shown in equation (89), in exactly the same way that the parallel 
coefficients of partial correlation are obtained. Since the curvilinear 
transformation relates solely to the net regression of Xi on each of 
the independent variables, the meaning of the partial indexes with 
respect to the separate variables is open to some doubt. 

Summary. For curvilinear multiple regression equations it is 
possible to obtain standard errors of estimate, indexes of multiple 
correlation, and indexes of partial correlation, which serve the same 
purpose that the comparable coefficients serve for linear multiple 
regressions. Owing to the extent to which the process of fitting the 
curves may exaggerate the significance of tlie results, it is even more 
important to adjust the several measures with respect to the number 
of observations and numbers of constants involved than it is with 
linear multiple correlation. 



CHAPTER 16 


SHORT-CUT METHODS OF DETERMINING NET REGRESSION 
LINES AND CURVES 

In problems where the correlation is fairly high, the number of 
variables is not too large, and the number of observations is rela- 
tively small (say not over 50 to 100 cases), net regression lines and 
curves may be determined by a combination of inspection and graphic 
approximation which takes only a fraction of the time required by 
the methods previously presented in detail.’^ This graphic method 
is very speedy, and in the hands of a careful worker can yield results 
almost as accurate as those obtained by the longer methods previously 
set forth. It must be used, however, with the same regard to the mean- 
ing of correlation results, to the care in selection of material, and to 
the consistency of results with those logically expected as the other 
methods. It is subject to even more severe limitations with respect 
to the sampling variability of the results obtained from successive 
samples than are the other methods. For these reasons the student 
should first become thoroughly acquainted with the preceding methods, 
and their meaning and limitations, and then use this method only as 
a more rapid procedure for obtaining substantially the same results. 

The general basis of the short-cut method is to select, by inspection, 
several individual observations for which the values of one or more in- 
dependent variables are constant, and then note the change in the 
dependent variable for given changes in the remaining independent 
variable. This process is repeated for additional groups of observa- 
tions for which the other independent variable or variables are con- 
stant (or practically so) but at a different level than for the first 
group. The relation between the dependent variable and the remain- 
ing independent variable, as indicated by a series of such grou])s, 
approaches the net regression line or curve, since the cases have been 
selected so as largely to cancel out the variation associated with otlier 

1 L. H. Bean, Applications of a simplified method of graphic curvilinear correla- 
tion, mimeographed preliminaiy report, U. S. Bureau of Agricultural Economics, 
April, 1929; and A simplified method of graphi-c curvilinear coiTelation, Journal 
oj the Ameiican Statistical Association, Vol. XXIV, pp. 386-397, Do(‘ember, 1929. 



LINEAR NET REGRESSIONS 


269 


independent variables. A first approximation line or curve is then 
drawn in by eye, and the residuals from this curve, measured graph- 
ically, are used to determine the regression for the next variable, 
cases again being selected so as to eliminate the influence of other inde- 
pendent variables. The final fit of the several lines or curves is 
tested by the same successive approximation process employed in 
Chapters 10 and 14, or by a shorter graphic equivalent of it. Since the 
initial lines or curves approach much more closely to the final net 
regressions, and since graphic transfers of residuals are substituted for 
curve reading and computation of the the process is much shorter 
and fewer steps are required. 

Linear net regressions. The short-cut method for linear regres- 
sions may be illustrated by the same farm-income problem utilized 
in Chapters 10, 11, and 12. The first step is to number each one of 
the observations as listed in the first four columns of Table 47, page 
199, so that they may be distinguished from one another. 

Preliminary examination of inter'-relationships. The next step is 
to make dot charts of the intercorrelations of the independent variables, 
to see how they are related. Since there are three independent vari- 
ables, Xo, and X4, there are three sets of such interoorrelations — 
X 2 with A's, A"o with A'‘4, and X3 with X 4 , Dot charts for these 
combinations are sliown in Figure 42. In entering these charts, we 
identify each observation by its own number, for future reference. 

Examination of Figure 42 shows a moderate negative correlation 
between cows and acres and men and acres, and a slight ]>ositive cor- 
relation between cows and men. If charts such as these showed 
practically perfect correlation between any two independent variables 
— all the dots clustering closely together along a line or curve— that 
would be a warning tliat those two variables were so closely inter- 
related that it would be difficult or impossible to untangle the separate 
effects of each, regardless of what method was used. In such a case, 
one of the independent variables should he dropped, and the regressions 
found for the other variable' should be stated as the relation of the 
dependent variable to the values of the independent variable retained 
and the associated values of the independent variable which was ex- 
cluded. In this case, the intercorrelations are all low enough so that 
it will not be difficult to se]:)aratc out the effects of each one.- 

2 IntorooiTolation anionR llio indepondont vsiriahlos Miat is bill; not, ])(‘rfoch 
reducoa ilio spood with wliich I, ho HUo<‘osHivo approximations convfM-fiii^ toward the 
best values, those which would be found by haist sqnan'.s. In sucli ciisos many 
more approximations may be required to get the best siinultaiicous fit. 



270 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


The next step is to chart the values of the four variables for each 
observation in succession and connect them by lines just as if they were 
entries in a time series, as shown in Figure 43. (Classifying the records 
in order with respect to one of the independent factors before taking 
this, step might be advisable.) 


Cows -Z} Men-Z^t 



Cow5 - Z 3 

Fig. 42. Dot charts showing the intercorrelations of the independent 
variables, X 2 , Xs, and ^ 4 . 


Comparing the different lines in Figure 43, we see that variation 
in incomes appears to be more closely associated with variations in 
cows than with either of the other factors. (Dot charts of Xi with Xo, 
Xi with X 3 , and X^ with X 4 , might be used instead to reach this con- 
clusion.) The relation of X, to Xg, number of cows, for constant num' 
bers of acres and men, will therefore be examined first. 





FIRST APPROXIMATION REGRESSION LINES 


271 


Determination of first approximation regression lines. From Figure 
42 we note that of the farms with the largest numbers of acres, both . 
farms 2 and 10 have 3 men employed, whereas farms 13 and 17 have 
4 and 2, respectively. Accordingly we plot the cows and incomes 
for these farms on a new dot chart as shown in Figure 44, indicating 
the number of the farm represented by each dot, and using solid dots. 
The placing of these dots does not seem to indicate any marked rela- 
tion of income to the number of men; we therefore draw in a straight 
line freehand, to fit approximately the change in income with changes 



I 3 5 7 9 II 13 15 17 19 

Record number 

Fia. 43. Acres, cows, men, and income, on 20 farms. 

in cows, as shown by these four observations. (The values may be 
taken from Table 69, page 277.) 

Turning to the small farms, on the X2A4 section of Figure 42, 
we note that farms 6, 15, and 18, each with between 90 and 110 acres, 
have 1 man apiece; and farms 8, 11, and 20, with 70 to 110 acres, 
have 2 men apiece. Plotting the corresponding observations as hollow 
dots on Figure 44, again we have little evidence of any influence of the 
differences in number of men. The other small farms, 4, 5, and 16, 
are accordingly plotted, and a line, estimated graphically to pass 
through the nine observations as well as possible, is drawn in as shown. 


272 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


Finally, it is noted that farms 7, 9, 14, and 19 all have 160 to 
170 acres, so these are plotted on Figure 44 as crosses, to distinguish 
them. The differences in the number of men are ignored at this step, 
since they have been found to have little apparent relation to the 
income in the previous cases, and a line is drawn through these last 
cases, as indicated. 

Comparing the three lines, we see that all have about the same 
slope, so a single line is drawn in to pass through the intersection of 
the averages of cows and of income, with a slope averaging the slope of 
the other three lines. This last line is the first approximation to the 
net regression of income on cows, with acres and men constant. The 

Income 

z 

1200 

1000 

800 

600 

Fig. 44. Income plotted against cows, on specified farms, and first approximation 
to net linear regression on cows. 

dots for the remaining farms, numbers 1, 3, and 12, are then plotted 
in, with the numbers to indicate their identity. 

For the next step, a blank chart is prepared, as shown in Figure 
45, to show the relation between acres, X 2 , and the departures of 
income, from that expected on the basis of the approximate re- 
gression on number of cows. This chart is completed by scaling off 
the vertical departure of each observation in Figure 44 from the 
approximation line, and then plotting that departure in Figure 45 
as a departure from the zero line, with the number of acres for the 
same observation as abscissa.^ The identity of the observation repre- 
sented by each dot is again shown by its number. Here, to aid in 
identifying observations according to the other independent variable, 

® For a convenient and speedy method of scaling off and transferring these de- 
partures graphically, see pages 479 to 485. 


I .1 . I " _ 

First opprox/moff on to net regression Jme 
_ (ba) ^ 

Re/ofton indicated medium 
sijed farms | 

t Relation ina/cofed,^ V ~ 

by/or^e 

.forms . 



Number of cows - X, 



FIRST APPROXIMATION REGRESSION LINES 


273 


solid dots have been used for farms with 1 man, circled dots for farms 
with 2, and crosses for farms with 3. The 2 farms with 4 men are also 
shown as solid dots. The relation of acres to income is now clearly 
evident (in fact, were this not a discussion of linear correlation, fitting 
a curve would seem to be justified). It is next noted that farms 4, 5, 
6, 15, and 18 have but 1 man apiece. Accordingly a line is dotted in 
to pass as near the dots for these farms as possible. Farms 2, 7, 10, 
12, and 16 have 3 men each, so a line is fitted to them graphically, 
as indicated. Farms 1, 8, 9, 11, 14, 17, 19, and 20 have 2 men each, 
so they are designated by enclosing each of them with a circle, and 
a line is fitted freehand to them. All these lines are of somewhat 
the same slope, so a final line is drawn in by eye, averaging the slope 
of the other lines and intersecting the zero line at the abscissa cor- 


Income adjusted 
for cows I 

X, bjX, I — First opproximafion\ 

•3 to net regression 

+ 200 I — « . . . line(b,) 

Re /of ion indicoted ty 
^ I Q Q forms mth t/iree men 



-200 


Re/at ion ind/coted by 
farms wit/j two men 


tPO Relation indicated by farms with one man 

_l 1 J I I 


60 


100 


140 


180 


220 


260 


Acres X, 


Fig. 45. Income adjusted for cows (by first approximate regression), plotted 
against acres on specified farms, and first approximation to net linear regression 

on acres. 


responding to the average number of acres. This line is the first ap- 
proximation to the regression of income on acres determined while 
holding constant the approximate effects of both cows and number of 
men. 

The next step is to prepare a chart for number of men and ad- 
justed income, as shown in Figure 46. The deviations of the indi- 
vidual observations from the approximate regression line in Figure 
45 are measured graphically, and plotted in as deviations from the 
zero line in Figure 46, with the number of men for each observation 
as abscissa. The placing of these dots indicates a tendency for in- 
come to increase with number of men. The average adjusted income 
for each number of men is determined by inspection, and indicated 
by the small circles. Then a straight line is fitted by eye so as to 



274 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


intersect the zero line at the average of Z4, and fit these averages as 
well as possible. 

Determination of second approximation net regression lines. The 
next step is to check the slope of the previous approximate net regres- 
sion lines, to see if any changes are needed, now that the effect of 



Fio. 46. Income adjusted for cows and acres (by first approximate regressions), 
plotted against number of men on specified farms, and first approximation to net 

linear regression on men. 


other factors has been more accurately allowed for. To do this, the 
line from Figure 44 is drawn in on Figure 47. The deviations of each 
of the observations in Figure 46 are then scaled off graphically, and 
plotted in Figure 47 as vertical deviations from the line, with the 
number of cows, X^, as abscissa. The plotting of these deviations 



Fig. 47. Income adjusted for acres and men (by first approximate regressions), 
plotted against cows, and first and second approximations to net regressions on 

cows. 

indicates that a slightly steeper line might fit better, since it is found 
that, although in the range 0 to 2 cows, 2 dots fall below the line 
whereas 3 fall above, in the range 14 to 18, 4 out of 6 dots fall above 
the line, and in the range 6 to 8 cows, 3 of the 5 observations fall below 
the line. Accordingly a revised line is drawn in free hand, passing 




SECOND APPEOXIMATION REGRESSION LINES 


27a 


through the intersection of the averages of cows and income as before, 
and fitting the new dots as well as possible. The first line for the re- 
gression of income on acres is then checked in the same manner, by 
plotting the deviations from the new line in Figure 47 as deviations 
from the first approximate regression on acres (Figure 45). This 



Acres Sg 

Fia. 48. Income adjusted for -cows (by second approximate regression) and men 
(by first approximation), plotted against acres. 


process, carried out by graphic plotting just as before, is shown in 
Figure 48. 

The distribution of the dots in Figure 48 shows that the observations 
are so nearly evenly balanced about the line now that no further change 
in the line is necessary. It is evident that a curve would fit better than 


Income odjusted 
for cows and acres 



Fig. 49. Income adjusted for cows and acres (by second approximate regressions), 

plotted against men. 


the straight line, but for the present we arc considering linear relations 
only. 

Since no change has been made in the regression for X 2 , all that 
remains is to check the first line for the regression on X 4 , using the devia- 
tions from the line in either Figure 47 or in Figure 48. Plotting these 
deviations graphically as before, above or below a line with the same 
slope as in Figure 46, gives the result shown in Figure 49. Since this 




276 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


figure shows no significant change from Figure 46, the line is left 
unchanged, and the lines on Figures 47, 48, and 49 are accepted as 
giving the approximate values for ?)i3.24> 612. 34> and 614.23, respectively. 
If the increases in income per unit change are calculated from these fines 
they come out 29.2 dollars per cow, 1.34 dollars per acre, and 52.7 dollars 
per man, as contrasted to the exact values of 26.3, 1.21, and 50.3, 
worked out in Chapter 12. Although the values are not identical, they 
are quite close— so close, probably, that the differences between them 
have no statistical significance in view of the small number of observa- 
tions on which they are based. (If a larger number of succesive 
approximations were used, and the average residuals were computed 
at each step as a guide to the new lines, the final values would come 
even closer to the exact values.) 

Estimating values of dependent variable. The estimated income 
may now be worked out for each farm, either by taking readings 
directly from each curve or by substituting the approximate values 
found for the regression coefficients in equation (39) to determine a, 
and then working out the estimates mathematically. In either case, 
the correlation and standard error could be computed only by work- 
ing out the estimated values, calculating the residuals and their stand- 
ard deviation and substituting those in equations (42) and (48) . The 
process of computing the estimates by using values read directly from 
the figures is shown in Table 69. 

Calculating standard error of estimate and multiple correlation. 
The standard deviation of the z^s computed in Table 69 is 69.06. 
By substituting this value in equations (42) and (48), the standard 
error of estimate and the multiple correlation work out as follows: 


5 ^ 1.234 = 


n — m 


20(4,632) 

16 


5,790 


Sl.234 = 76.09 


R 


2 

1.234 


^^ 1.234 / ^ — l \ 
O'! \ n / 


5,790 /19\ 
27,276 \20/ 


■Ki.234 — 0.893. 


0.798 


The new standard error of $76.09 compares with that of $74.65 
obtained by the regular least-squares method, and the multiple corre- 
lation of 0.893 by the approximation method compares with the value 
0.898 obtained by the more exact method. As indicated by these 
slightly lower coefficients, the approximation method is not quite so 



SHORT-CUT METHOD APPLIED TO CURVILINEAR REGRESSIONS 277 


precise, yet for most practical purposes the results are nearly the 
same.** 

The short-cut method applied to curvilinear regressions. The 

greatest usefulness of the short-cut method is in determining net 
curvilinear regressions. Since the method of successive graphic ap- 

TABLE 69 

Calculation of Estimated Income from Linear Regressions Determined by 
Approximation Method 


Num- 

ber 

Acres 

X3 

Cows 

Xi 

Men 

Xi 

Income 

MXh 

MXi) 

/4(X4) 

XI 

z 

1 

60 

18 

2 

960 

-106 

1,134 

-11 

1,017 

- 57 

2 

220 

0 

3 

830 

+110 

612 

+42 

764 

66 

3 

ISO 

14 

4 

1,260 

+ 56 

1,022 

+94 

1,172 

88 

4 

80 

6 

1 

610 

- 80 

789 

-62 

647 

- 37 

5 

120 

1 

1 

590 

- 26 

641 

-62 

553 

37 

6 

100 

9 

1 

900 

- 52 

876 

-62 

762 

138 

7 

170 

6 

3 

820 

+ 43 

789 

+42 

874 

- 54 

8 

110 

12 

2 

880 

- 39 

964 

-11 

914 

- 34 

9 

160 

7 

2 

860 

+ 29 

818 

-11 

836 

24 

10 

230 

2 

3 

760 

+123 

670 

+42 

835 

- 75 

11 

70 

17 

2 

1,020 

- 93 

1,110 

-11 

1,006 

14 

12 

120 

15 

3 

1,080 

- 26 

1,051 

+42 

1,057 

23 

13 

240 

7 

4 

960 

+136 

818 

+94 

1,048 

- 88 

14 

160 

0 

2 

700 

+ 29 

612 

-11 

630 

70 

15 

90 

12 

1 

800 

- 66 

964 

-62 

836 

- 36 

16 

110 

16 

3 

1,130 

~ 30 

1,080 

+42 

1,083 

47 

17 

220 

2 

2 

760 

+110 

670 

-11 

769 

- 9 

18 

no 

6 

1 

740 

- 39 

789 

-62 

688 

52 

19 

160 

12 

2 

980 

+ 29 

964 

-11 

982 

— 2 

20 

80 

15 

2 

800 

- 80 

1,051 

-11 

960 

-160 


proxiinations presented in Chapter 14 also depends on tlic convergence 
of successive approximate curves, the short-cut method secures results 
which are exactly as reliable, at a great saving of time. 

^In fact, the differoiK^es betwcon the values obtained by exjict solution and 
those obtained by the approximation method are no larger than inij»:ht readily 
occur by chance if the mathemal.kuxi analysis were repeated on a second sample of 
the same size, to judge from the standard errors of the three regression coeflicuents, 
when computed by the methods explained in Chapter 18. 



278 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 

The procedure will be illustrated by a problem of four variables. 
The same method may be applied to larger or smaller problems equally 
well. 

The data to be considered are: 

TABLE 69A 

Data for Short-Cut Method op Determining Regression Curves* 


Year 

Xi 

Cost per ton of 
finished steel 

Xi 

Proportion of capac- 
ity operated 

Average hourly 
earnings 

Xs 


Dollars per ton 

Per cent 

Cents per hour 

1920 

72.3 

88.3 

77.6 

1921 

78.5 

47.5 

60.2 

1922 

57.9 

71.3 

68.6 

1923 

63.0 

88.3 

67.0 

1924 

63.7 

69.0 

70.8 

1926 

62.9 

78.4 

70.3 

1926 

60.3 

88.0 

70.8 

1927 

59.6 

78.9 

71.3 

1928 

55.2 

83.4 

71.8 

1929 

51.6 

89.2 

72.5 

1930 

58.6 

65.6 

73.2 

1931 

65.6 

38.0 

70.8 

1932 

81.4 

18.3 

61.0 

1933 

65.0 

28.7 

59.0 

1934 

64.6 

31.2 

70.0 

1935 

65.4 

38.8 

73.0 

1936 

61.1 

59.3 

74.0 

1937 

65.6 

71.2 

86.0 


* The data are caloiilated from regular published reports of the U. S. Steel Corporation. Seo 
Kathryn H. Wylie and Mordecai Eyiekiel, The cost curve for steel production, Journal of Political 
Economy t Vol. XLVIII, pp. 777-821, December, 1940. 


Data for 1938 and 1939 are also available, but we shall disregard 
them until the analysis is completed, and then use them for checking 
the results. 

Logical relation of the variables. These data are from a study of 
the relation of volume of steel ou^ut to cost per ton. The qualitative 
examination of the problem (see discussion in publication cited in the 
footnote to Table 69A) indicated that changes in wage rates might be 




CONDITIONS ON THE CURVES 


279 


expected to have a relative, or multiplying, effect upon the cost for a 
given output, so that the relation might best be examined in terms of: 

log Xi = f 2 {X 2 ) + 

Mso, the qualitative examination revealed that major changes in 
technical methods of production, especially the beginning of the sub- 
stitution of continuous-strip mills for hand mills, had taken place 
during tlie period under consideration, and that these improvements 
in technology might need to be included, either directly as a labor- 
efficiency factor or, indirectly, as a trend factor. 

To simplify this illustrative presentation, the data will be used 
in absolute values, instead of using the logarithms. The charts will be 
examined for indications of multiplying relationship, however, since 
(as is shown in detail on page 296) this graphic method can also be used 
to 8i)ot the presence of such non-additive relations. 

Conditions on the crimes to be dremn. Before proceeding to the 
statistical steps in the examination of these data, the types of curves 
logically expected and the resulting conditions to be placed upon the 
shui)es of the (uirves to be obtained must also bo considered. Without 
going into the tmdcrlying technical reasons (presented more fully in 
the original study), let us assume that the following conditions will be 
imposed: 

On the net relation of cost to capacity: 

1. The curve may fall, at a declining rate, until a minimum is 
r(*achcd, and may then increase gradually after that minimum is 
])asscd. No points of inflection are expected. 

On the net relation of cost to wag<*s: 

2. The curve will rise steadily, possibly at an increasing, rate with 
higher wages, but otlnnwise will be fairly uniform — that is, will be 
('itht'r a straight line or u sludlow curv(^ concaves from above. There 
sliould l;e no inflc'ctions. 

On fhe nc‘t relation of cost to the time (dements (efficiency, etc.) : 

3. Th(‘ curve* will temd to decline, perhaps slowly at first and then 
moH’ and more rapidly ms new ti’chniques are introduced. There might 
also be irregular changes reflecting fhe changes in general price level 
(and in various i^undiased maierifds and services other than labor) 
during tlie pcTiod under examination, especially in the early 1020’s and 
after 1029. (Note how this trend factor lumps together labor efficiency, 
price l(*vtds, and perhaps otluT factors, each of which might be givtm 
separate consideration in a more elaborate investigation.) 



280 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


Preliminary examination of inter-relationships among the inde- 
pendent variables. As before, the inter-relationships of the several 
independent variables (including time for the trend factor) must be 
examined before the short-cut approximations can be begun. These 
are presented in Figure 50, the years being used to designate the 
observations. After the dots were located, the successive years were 
connected by a light line, making it possible to consider the relations 
of X 4 (time) to and JC 2 ; as well as of X 2 to ^ 3 , all on this one chart. 
(This same method could be used even in non-time-series data by first 
classifying the data on the ascending values of one independent vari- 
able. Successive observations, by number, would then indicate in- 
creasing values for that variable.) 



Per cent of capacity operated- 

Fig. 50. Wages and per cent of capacity operated, with successive observations 
connected to indicate shift in the X 2 XZ relationship with time. 

Examining first the location of the dots in Figure 50, without 
regard to their sequence, a moderate intercorrelation between wages 
(X^) and rate of operations (X 2 ) is evident. No low values of X 2 
are found, except together with low values of A 3 . In the higher ranges 
of X 2 the values of A 3 fan out more, varying from quite low to quite 
high. Apparently there is enough independence in the occurrence of 
the two variables to permit of fairly good separation of their effects. 

When examined with regard to time, however, the independence is 
not so good. The low wages at high output all occurred in one period — 
1921 to 1923. The marked positive correlation of wages and operations 
from 1930 to 1937 is also a correlation with time, both generally de- 
clining from 1930 to 1933, and both rising from 1933 to 1937. Since 
this was the period when technological changes were greatest, it may 




FIRST APPROXIMATION CURVES 


281 


be diflBicult to disentangle the time or trend elements here, reflecting 
these technological changes, from the effects of the associated advances 
in output and in wages. We shall have to be on guard for this as we 
proceed with the analysis. 

Looking for groups of observations which hold the other factor 
constant, we note on Figure 50 that there were a considerable number 
of years when wages ® fell between 70 and 75 cents per hour. These 
observations for these years may be used to hold wages substantially 
constant, while the data are examined for the apparent effects of 
operation rate and time. 

Determination of first a'pproximation curve for first independent 
variable. The observations for the years with wages of 70 to 75 cents 
are accordingly plotted on Figure 51 with percentage capacity operated 
(X 2 ) as the abscissa and cost per ton (Xi) as the ordinate.® After 
the dots are plotted, successive observations (when they occur in this 
group) are connected by light dotted lines. This enables us to examine 
the relation of cost to operation rate and time while holding wages 
constant. 

These observations indicate at once a marked negative correlation 
between operation rate and cost. The data from 1924 to 1929 suggest 
a rapid fall in cost for a given rate, especially from 1927 to 1929. Ap- 
parently there was some further decline from 1931 to 1934, but the 
data for 1935 to 1936 fall almost precisely on those for 1930 to 1931. 
(However, examination of Figure 50 shows that wages were slightly 
higher in this latter period, which might obscure the trend factor at 
this point.) No curve is indicated as yet. Accordingly, a line is drawn 
in lightly, as indicated, to show the relation of cost to operation rate 
for these observations, with the trend factor also considered.'^ 

“Wage rates per hour” is quite a different thing from “average earnings per 
hour employed,” since the latter is a weighted figure reflecting all changes in the 
composition of the labor force. The latter is the figure used here (note Table 69A), 
since an average wage-rate figure was not available. For brevity, however, the 
term “wages” will bo used here to describe the data, even though that is not the 
technically correct designation. 

® Great care should be exercised in plotting these values, ns their exact location 
becomes the basis for all the successive graphic transfers. Chart paper of adequate 
size to separate the dots should bo used. 

^ By drawing this line parallel to the lines connecting successive years, all trend 
is eliminated except the one-year change. If the line wore tilted slightly steeper 
than the line connecting successive years, that would provide an aiiproximate cor- 
rection for the year-to-year change, also. Wit.h the uncertainty of trend effects 
after 1931, however, that was not done here, but. was left for subsequent approxi- 
mations to clarify. 



282 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 

The observations for years of very low wage rates — 1921, 1922, 
1932, and 1933 — are next plotted, and consecutive years again con- 
nected by dotted lines. Both show exaggerated drops in costs with 
increases in output. Only 1933 shows a cost lower than might be 
expected from the observations previously plotted. If 1932 were also 
to show a cost below the usual relation, the regression curve would 
have to swing up sharply, so as to pass above it. The high value for 

Cost per ton 
X/ 

75 
70 
65 
60 
55 

10 20 30 40 50 60 70 80 90 

Proportion of capacity operated -Xz 

Fig. 51. Cost per ton and per cent of capacity operated, and first 
approximation to / 2 (-X' 2 ). 

1921 may be ignored for the moment, as possibly reflecting the high 
price levels at the end of the first World War inflation. 

The two years of high wages — 1920 and 1937 — and the one remain- 
ing year of moderately low wages, 1923, are next plotted. The dot 
for 1937 falls above the other observations, and that for 1920 much 
higher still, apparently confirming the unusual (trend?) factors affect- 
ing the position of the 1921 observation. Similarly 1923 is fairly high, 
despite its moderate wage rate, as compared to subsequent years. 

The evidence as to wage rates, to this point, sums up as follows: 
1920 to 1923 all show relatively high costs (with the exception of 





FIRST APPROXIMATION CURVES 


283 


1922). Apparently trend elements outweighed the effects (if any) 
of the low wages in 1921 and 1923. With low rates, 1933 shows quite a 
low cost for the low rate of output, whereas 1932, with somewhat 
higher wage rate, shows a much higher cost. Apparently the fall in 
output to near zero increases cost very greatly per unit. On the 
basis of these considerations, a curve could be drawn in as the first 
approximation, extending the previous line but bending it up to pass 
well above 1932, with its low wage rate. With only one or two observa- 
tions to support that bend at this stage, it seems best to be more 
conservative until the other factors have been more definitely allowed 
for, and until the evidence for a curve (if any) is more clearly estab- 
lished (even though a curve of declining costs was expected.) 

Accordingly the straight line previously drawn in lightly is ex- 
tended and used as the first approximation toward the net regression, 
/2 (X 2 ). (If a curve had been clearly indicated by the examination of 
the data as described above, it would have been drawn in at this point, 
thus starting the successive approximations from a curve instead of 
from a straight line.) 

Determination of first approximation curve for second independent 
variable. The next step is to examine the relation of costs, as now 
approximately corrected for the relation to operation rate by f!^{X 2 )j 
to wages and time. Accordingly, the vertical departures of the dots 
on Figure 51 from the line of / 2 (^ 2 ) scaled off, and are plotted in 
Figure 52.® The departures are plotted as ordinates, with the values 
of Z 3 , wages, as abscissas. If the fourth variable, X 4 , were not a time 
series, or not arranged in order, it would be necessary to group these 
observations according to its value, also, as was done in plotting 
Figure 51. Since the numbers of the successive years indicate the 
successive values of X 4 , that is not necessary. After the dots are all 
plotted, the successive years are connected by a light dotted line, to 
aid in separating the trend influences from that of wages. 

If the dotted line to the successive years is followed, it is apparent 
that there was a general downward trend in the adjusted costs. The 
years 1920 and 1921 appear on one level, the years 1922 to 1927 on a 
lower level, and the years from 1928 on (with the exception of 1932) 
on a still lower level. In each of these groups of years there is a 
positive relation between adjusted costs and wages, as indicated by the 
light lines drawn through each group. Only the last group has any 

®As with tho linear short-cut method, the job of makinp; these readings and 
transfers can be made swifter and more accurate by usinfz; the technique outlined 
on paffes 479 to 485. 



284 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 

indication of a curve. Even there, the curve depends entirely on the 
position of the two extreme observations, one at each end. Here, how- 
ever, the lower portion of this curve parallels, almost exactly, the lines 
indicating the apparent positions for the two other groups, which in 
turn lie mainly on the left half of the lower group of observations. 
Furthermore, the shape of the curve — shallowly concave — is consistent 
with that logically expected. Accordingly, a shallow curve passing 
through the center of the observations is drawn in, approximately 
paralleling the apparent lines and curve representing the relations for 
the three groups. The succeeding successive approximations will show 


Cost adjusted for operation rate 



Fig. 62. Wages and cost per ton adjusted to average operation rate on the basis 
of the fii’st approximation, and first approximation to /aC-X’a). 

whether this curve is justified or whether a straight line should be 
substituted. 

Determination of first approximation curve for third independent 
variable. The next step is to examine the relation of costs, now ap- 
proximately adjusted for both wages and operation rate, to time. Ac- 
cordingly, the vertical departures of the dots on Figure 52 from the 
curve fgiXg) are scaled off, and are plotted in Figure 53. Again the 
departures are plotted as ordinates, with this time the values of X 4 
as abscissas. Since this is the last independent variable to be con- 
sidered, it is not necessary to group the observations with respect to 
any other variable but all can be plotted and examined as a whole. 





FIRST APPROXIMATION CURVES 


285 


Figure 53 shows the resulting chart. Connecting the successive years 
makes it easier to study the type of trend present.® 

Except for the single wide departure in 1932, Figure 53 indicates 
a definite downward trend from the beginning, tapering off about 1930 
and running flat or gradually rising thereafter. Taking midpoints 
between each pair of observations (indicated by the crosses) helps to 
locate the approximate level of this trend. The one extreme departure, 
1932, is disregarded in the process. Its position in Figure 51, at the 

Cost adjusted for 
operation rate and wa^es 


+ 10 
+ 5 
0 
- 5 
-10 

'20 '22 '24 '26 '28 '30 '32 '34 '36 '38 

Year- Xj. 

Fig. 53. Time and cost per ton adjusted to average operation rate and wages, 
on the basis of the first approximation curves, and first approximation to 

extreme end of the line, meant that its adjustment for X 2 was in doubt. 
A smooth curve is then drawn in, declining to about 1930, and running 
flat thereafter. The rising trend indicated by the observations for 1936 
and 1937 is left for subsequent approximations to confirm. In general 
it is unwise to give an extra “twist^^ to a regression curve simply on 
the evidence of one or two observations. 

®If joint functions are suspected (see Chapter 21) the data might again be 
grouped for values of Xo and A'a, in plotting Figure 53. If these groups showed 
varying relations to X4, even after the approximate relations io X 2 and Xu had 
now been eliminated, that would indicate the presence of a joint relation. Note 
Figure 57, and the discussion on pages 296 to 299 of this chapter. 




286 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 

Determination of second approximation curve for first independent 
variable. We now have determined first approximation lines or curves 
to the net regressions of on X 2 , X^, and X 4 . The departures of 
the dots on Figure 53 from the regression line f'^ (• 3 r 4 ) are the residuals, 
z", from this first set of curves. The remaining steps involve the 
graphic transfer of these residuals to each curve in turn, the correc- 



Fiq. 54. Per eent of capacity operated, and cost per ton unadjusted and adjusted 
to average values of other variables, and second and third approximations to hiXz). 

tion of each curve on the basis of the fit of the new residuals, and in 
turn the transfer of the newly corrected residuals to the next curve, 
and so on until no further change is indicated in any of the curves. 
Ordinarily the residuals from Figure 53 would be plotted back on the 
original curve for X 2 , Figure 51. To show the process clearly, however, 
the dots and the first approximation curve for /'(Z 2 ), from Figure 51, 
are reproduced again as Figure 54. 




SECOND APPROXIMATION CURVES 


287 


The vertical departures of the dots on Figure 53 from the approxi- 
mation curve, / 4 (X’ 4 ), are then plotted on Figure 54 as departures 
above and below the regression line, with the corresponding 

values of X 2 as abscissas. To prevent confusion with the original values 
shown as solid dots, the corrected values are indicated as hollow dots. 

It is at once apparent, on inspection of Figure 54, after the cor- 
rected values are all plotted in, that the new values show much less 
scatter than the original values. Closer inspection reveals that every- 
one of the adjusted observations below 60 per cent of capacity falls 
above the first approximation line, with a single exception. In the 

Cost’ 


+ 10 

♦ 5 

0 

- 5 

-10 

55 60 65 70 75 80 85 90 

Wa^cs - (Ij > 

Fia. 66. Wages, and cost per ton adjusted to average values of all other 

variables, and second and third approximations to /sCXa). 

range from 60 per cent to 80 per cent, three cases fall below the first 
approximation line (two widely) and three slightly above, indicating 
in this range that the new line should be lower than before. The five 
observations above 80 per cent fall two below, two about the same 
distance above, and one right on the line, indicating that the position of 
the line here is about correct. These departures confirm the sugges- 
tion previously given by the 1932 value in Figure 51 that the regression 
should be a curve, concave from above. This accords, also, with the 
logical conditions originally imposed on this relation. Accordingly 
such a curve is drawn in freehand, passing as near as possible through 
the averages of the adjusted values in each successive group. (To 
facilitate drawing the curve, the average of the residuals in successive 


adjust-ed for ofher variables 





288 GRAPHIC METHOD EOR MULTIPLE CURVILINEAR REGRESSION 


ranges of 10 to 15 units of X 2 are estimated graphically and drawn in 
as hollow squares.) 

Determination of second approximation curve for second inde-^ 
pendent variable. The vertical departures of the adjusted values 
(the hollow dots) above or below the second approximation curve, 
(X 2 ), are next scaled off graphically and plotted as ordinates from 
the values of the /^(^s) curve, as zero, with the corresponding 
values as abscissas. This is generally done on the original X 1 X 3 chart 
(Figure 52). For clarity, however, the curve of Figure 52 is here 
reproduced on Figure 55, and the departures from Figure 54 are trans- 
ferred to this new chart. The four observations around 60 for X 3 
average definitely below the line; both the next group up to 72.5 and 
the next group 72.5 up to 75 average slightly below, whereas the single 
observation above 85 falls above the line. These averages are indi- 
cated by squares on Figure 55.^° The single high observation at the 
end alone would not be enough to indicate a change in the curve, but 
it is consistent with the group averages, which indicate the need for 
a slightly steeper curve than the original one. Accordingly this new 
curve is drawn in, approximately through the group averages, but 
still conforming to the conditions stated on page 279. To this point 
none of the relations, as indicated by the data, has differed sufficiently 
from the shapes logically expected to require any reconsideration of 
the logical analysis from which the conditions limiting the shapes to 
be drawn were derived. 

Determination of second approximation curve for third independent 
variable. The same process is used in determining the second approxi- 
mation for the next variable. The vertical departures of the dots on 
Figure 55 above or below the second approximation curve, /''(A^s), 
shown as a dashed line, are scaled off and plotted as departures from 
the f'^iX^^) curve, with the corresponding -Y 4 values as abscissas. Again 
a new chart is prepared, Figure 56, with (X 4 ) reproduced, although 
the original chart. Figure 53 (on page 285) , is still clear enough so that 
these new values could readily have been plotted upon it. Again, 
as the observations are equally spaced in time, a continuous light line 
is drawn in, connecting the successive observations. 

If the curve were any ordinary function — anything except a trend 
allowance for a number of unrepresented factors — there would be little 
evidence, from the dots in Figure 5G, for any further change in the 
fitted curve. Since it is a trend allowance, liowever, and was ex- 

These averages have beea estimated graphically, by the technique explained 
on page 485. 



SECOND APPROXIMATION CURVES 289 

pected to be irregular on logical grounds (note the conditions stated on 
page 279), more flexibility may be in order. Comparing Figure 56 
with Figure 53, we see that the observations have been changed only 
slightly by the further adjustments for ^ 2 (^ 2 ) 
individual observations on both charts show a pronounced fall from 
1920 to 1924, a flattening out then for three or four years, then another 
fall to 1929. Between 1923 and 1927, Figure 56 shows that 4 out of 5 

Cost adjusted for 



Year-X^t 

Fig. 56. Time, and cost per ton adjusted to average operation rate ami wages on 
basis of second approximation curves; and second approximation to /.tC-Yi)- 


observations fall above the line, whereas, between 1928 and 

1935, 6 out of the 8 observations fall below the line. These departures 
indieate that some changes in the first curve arc justified. It is ap- 
parent that these changes would not be inconsistent with the jiossible 
composite effects of price-level changes and a general downward trend 
in production efficiency. The sharp fall from 1920 to 1924, however, 
largely reflects the two high observations for 1920 and 1921, offset 
somewhat by a very low observation in 1922. Accordingly, the trend 
may be interpreted as moderately downward from 1920 to 1926, more 




290 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


sharply downward to about 1929, then gradually tapering off to a low 
about 1933 or 1934, and rising gradually thereafter. A more flexible 
trend is therefore drawn in according to these general changes but not 
following single observations to the extremes of their departures.^^ 

Determination of third approximation curves. The same process 
as before is now repeated, plotting the departures from (A'4) around 
the /2 (^2) curve, with X 2 values as abscissas. This time the new 
departures shown on Figure 56 are plotted back on the previous chart. 
Figure 54. Crosses are used for the new departures, to distinguish them 
from the previous values shown as hollow dots. To prevent confusing 
the chart, the observation (year) number is not shown with the cross, 
except where there are two or more observations with about the same 
X 2 value. 

Examining the location of these new crosses on Figure 54, we 
notice that, for every observation with a value below 50 for Z2, the 
cross is one to one and one-half units (of Xy) higher than the cor- 
responding dot. For values of X 2 above 50, however, the crosses fall 
alternately above and below the corresponding dots, with the averages 
of the crosses hitting just about the curve. This pattern indicates that 
the f 2 iX 2 ) curve should be raised somewhat below 50, to be still 
steeper. Accordingly, a new curve is drawn in, changed as indicated, 
to pass as near as possible through the group averages of the crosses 
(as graphically estimated) and yet conform with the logical limitations 
on its shape. 

The vertical departures of the crosses from the new curve, /''/ (Aro)^ 
are then carried forward to Figure 54, as departures from (Xj). 
Again crosses are used to represent the new values. 

Inspection of Figure 55, after the crosses are inserted, discloses 
a different situation from that in the previous chart. In the left por- 
tion of Figure 55, for values of X^ below 65, the crosses fall very close 
to the corresponding dots, with no change for the average. In the 
right-hand portion, for values of X^ above 75, the crosses also fall 
above and below the corresponding dot. Between 65 and 75, however, 
a number of the crosses fall a considerable distance below the cor- 
responding dot, so that out of the twelve observations in this range, 
six crosses fall slightly above the /"line and six fall a considerable 
distance below. This pattern indicates that the /" curve should be 
made more sharply concave, without changing the elevation of either 

Only in rare instances would a -curve with this nnich /loxil)ility !)(- justifiod. 
In this particular case its use is in line both with the theoretical analysis and the 
resulting conditions imposed on the shape of the curve. 



THIRD APPROXIMATION CURVES 


291 


end. A new curve is therefore drawn in to correct this, throug i 
group averages of the crosses. (To prevent confusion, these avciag< • 
are not shown on Figure 55.) The sharp lift in the last portion of ^ u 
curve is dependent only upon the two observations, 1920 an<l 
However, the shape of this part of the curve is consistent with * lo 
logical limitations and with the other observations. Except for tm‘W(. 
two observations, a straight line would fit the crosses almost as w(ul an 
the curve. The evidence for the existence of a curve, or for its cxa<’t 
shape, is thus very uncertain, as the data are distributed here.^^ 

If the curves are compared with the /" curves on both Figure 54 
and Figure 55, it is evident that we have determined the shape of th('s(^ 
curves about as well as we can with the data at hand. Even with tlu^ 
material change in the trend by using the much more flexible (Uirve? 
of f” (X 4 ), the differences between the /" curves and the curvcH 
for X 2 and X 3 are insignificant. However, to complete the ju'och'hh 
we carry the final residuals, the departures of the crosses on Figure 55 
from the / curve, over to Figure 56, as departures from tlie 
line f' (X 4 ). 

There is no improvement in the average closeness of the crosKes to 
the trend line, f” (X 4 ), as a result of the slight changes in /o /:e 
The general characteristics of the trend, as fitted by tlic. prcviouH 
flexible curve, remain the same. From 1923 to 1930, every croHS fallH 
slightly above the corresponding dot, suggesting the possibility of u 
slightly better fit if the trend was raised a little in this ]iortion. 
single high value in 1932 continues to stand out, alone and iinexplaintMi. 
It seems hard to justify it on any trend basis. We could eliiuiruitc 
the wide departure for 1932 by twisting the lower end of up 

sharply to pass through this single observation. In tlic abscMirt^ of 
confirmatory evidence from another such low year for perconlagt* <»f 
capacity operated, this would be a risky assumption. 

Although it would be possible to modify the trend further, as sug 
gested in the preceding paragraph, it seems best to let it stand UU” 
changed. In view of the slight changes in the /2 and curvets in ila* 
last approximation, we end the successive approximation prt)(’(»ss id 
this point, feeling we have carried the process about to the point of 
diminishing returns in increased accuracy. 

It should be noted, in Figures 54, 55, and 56, that the finiil enrvrx 
at the end of the approximation process differ significantly from tin* 

12 See page 338 of Chapter 18 for the sampling reliability of i,ho portirio nf n 
curve determined by such extreme observations, where the theory of nintiom mum - 
pling may be properly applied. 



292 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 

first approximations only in the case of /2(-X^2)* Almost the same 
fiexible trend of (X4) could have been drawn in the first approxima- 
tion on Figure 53 . The closeness with which (X3), (X4), and 

(X2) approximate the final curves is an indication of the great 
power of the graphic method in making a rapid approach to the under- 
lying relations. The routine of comparing selected observations for 
which the values of the other independent variables are constant, or 
almost so, and judging the net relations from these selected com- 
parisons provides a much closer initial approximation to the final 
curves than does the initial assumption of linear net regressions, used 
as the starting point in the successive approximation process presented 
in Chapter 14 . 

(For an exercise, the student might take the example which has 
just been analyzed and determine the net regression curves by the 
method of Chapter 14 , using the same limitations on the shape of the 
curves as used here. That will enable him to compare the relative 
speed and effectiveness of the two methods in approaching the final 
curves.) 

As already noted the intercorrelations among X2, X3, and X4 were 
only moderate in this case. In a problem where the intercorrelations 
among the independent variables were quite high, the improvement 
in the fit of the several regression curves as a result of the successive 
approximation process might be more marked than it was in the ex- 
ample just completed. In such a case the convergence toward the 
curves of best fit will be slower than where the intercorrelations are 
low, and a larger number of successive approximations will be re- 
quired to determine the final curves. 

If, after several approximations have been made, the new curves 
start swinging up and down over curves previously determined, tlic 
approximation has probably been carried far enough. Especially where 
the intercorrelations for two independent variables arc very high, a 
rise in the slope of one curve will cause a fall in the slope of the other. 
In such a case the exact position of each of the two curves is inde- 
terminate, and the zone within which the last two or three approxi- 
mations vary will indicate something of the uncertainty as to the 
exact shape or location of each curve. As will be shown later (Chapter 
18 ) , the reliability of any net regression line or curve varies inversely 
with the extent to which the particular independent variable is cor- 
related with the other independent variables. Where two variables 
are so closely correlated that the relation to the dependent variable 
may be ascribed to either independent variable or parceled out be- 



STANDARD ERROR AND MULTIPLE CORRELATION INDEX 293 


tween the two, their individual effect is indeterminate. Only by secur- 
ing a large enough sample can the true influence of each be judged. 
When a large enough sample cannot be secured, that is the inherent 
fault of the data and not of the method employed. When used with due 
regard to the logical significance of the curves obtained, any one of 
the several methods will tend to give results which are substantially 
the same — that is, which lie within the range of possible accuracy 
imposed by the facts of the particular sample. 

Detennining standard error of estimate and the index of multiple 
correlation. The standard error of estimate may now be determined 
by first computing the value of This can be done most simply 

by scaling off, on Figure 56, the departures of the last adjusted values 
(the crosses) from the final trend curve. These departures are the 
Any errors which have been made in any of the successive graphic 
transfers will accumulate in these residuals. A more exact check can 
be made by reading off the estimated values for each observation from 
the fiinal curves and adding them up to calculate the estimated 
and z"' f according to the same method used in Chapter 14. The 
2 '" values as computed in this manner should agree closely with the 
2 :'"’s scaled from the final approximation chart. These calculations 
are shown in Table 69B. 

Column 10 of Table 69B gives the residuals as scaled off from the 
last apjiroximation curve on Figure 56. Column 9 gives the residuals 
as computed in the usual way from the several curve readings. It is 
evident that the two columns agree very closely, the largest difference 
being only 0.4. This is an indication of the degree of accuracy main- 
tained in the successive graphic transfers. In this case graph paper 
8 by 10 inches was used in preparing the charts for Figures 51 to 56, 
and each of the transfers was double-checked. If higher accuracy 
in the mechanical process is desired, a still larger scale could be em- 
ployed. 

Taking the residuals in Column 9 as the most accurate, we may 
now calculate their standard deviation (around their own mean). It 
works out at 2.88. This compares with a standard deviation for Xi 
of 7.19. 

Before computing 3 , 4 ) and Pi. 234 > we need the values for n 

and m. A simple parabola or hyperbola with two constants would 
probably represent f 2 {^ 2 ) ^^^id fz (A^a)- However, S 4 {X^ with its 
two inflections would probably reejuiro at least three constants. In 
addition, there is an a constant, represented by the mean of the 2 '" ’s. 
Altogether, then, it would probably take eight constants to fit mathe- 



294 GRAPHIC METHOD FOR MULTIPLE CURVILHSTEAR REGRESSION 


matical curves to the regression functions graphically determined. 
Accordingly, n = 18 and m = 8. With these values, we can now 
compute S and P by equations (65) and (66.2). 


5 i ./(2,3,4) 


ncr!„ ^ 18(2,88^) 
n — m 18 — 8 


14.9299 


> Sl ./ C 2,3,4) 
Pi .234 

Pi . 234 


3.86 

1 - 


> Sl ./(2 ,J 






14.9299 /17 

(7.19)2 V 


-) 

.18/ 


.7272 

0.85 


TABLE 69B 


Calculation of Estimated Xi from Final Regression Curves 


Year 

(1) 

(2) 

Xs 

(3) 

/rCSTo) 

(4) 

sTai) 

(6) 

/r(A’4) 

( C ) 

S(/2+A 

+/4) = 

(7) 



(10) 

1920 

88.3 

77.5 

67.1 

■n 

9.7 

71.7 



0.9 

1921 

47.5 

60.2 

67.8 


8.1 

74.1 

78.5 

4.4 

4.4 

1922 

71.3 

68.6 

60.6 

-2.1 

6.6 

64.9 

67.9 

-7.0 

-7.0 

1923 

88.3 

67.0 

57,1 

-0.3 

4.9 

61.7 

63.0 

1.3 

1.6 

1924 

69.0 

70.8 

61.0 

1.0 

3.4 

65.4 

63.7 

-1.7 

-1.8 

1926 

78.4 

70.3 

60.1 

0.8 

1.9 


62.0 

1.1 

0.8 


88.0 

70.8 

57.2 

1.0 

0.3 

68.5 

60.3 

1.8 

2.1 

mSm 

78.9 

71.3 

69,0 

1.2 

-1.6 

68.0 

60.6 

1.0 

1.3 


83.4 

71.8 

58.1 

1.4 

-3.7 

56.8 

65.2 

-0.0 

-0.5 

1929 

89.2 

72.5 

57.0 

1.8 

-6.4 

53.4 

51.5 

-1.9 

-1.7 

1930 


73.2 

61.9 

2.2 

-6.3 

57.8 

58.0 

0.8 

1.0 

1931 

38.0 

70.8 

72.2 

1.0 

-6.9 

66.3 

65.6 

-0.7 

-0.7 

1932 

18.3 

61.0 

84.6 

-1.7 

-7.3 

75.0 

81.4 

6.8 

6.9 


28.7 

59.0 

77.3 

-2.0 

-7.5 

67.8 

65 . 0 

-2.8 

-2.8 

1934 

31.2 

70.0 

75.8 

0.7 

-7.4 

69.1 

64.0 

-4.5 

-4.1 

1935 

38.8 

73.0 

71.7 


-7.0 

66.7 

65.4 

- 1.3 

-1.1 

1936 

69.3 

74.0 

63.5 

2.6 

-6.4 

69.7 

61.1 

1,4 

1 .3 

1937 

71.2 

86.0 

60.6 

11.0 

-5.4 

66.1 

65.0 

-0,6 

-0.8 


* These are the values of z"' scaled off from Figure 56 . 


The multiple correlation 0.85 is still close, even after the adjust- 
ment for the number of observations and constants. The standard 
error of estimate works out at $3.86 per ton. This indicates that if it 
were possible to measure this same relationship between otlier factors 
and costs from a very large sample drawn from the same universe, 
the errors in estimating steel costs for the observations in that large 
sample would probably have a standard deviation of $3.86.''* 

See pages 341 to 356 of Chapter 19 for the errors of individual forecasts and 
for the application of error formulas to time series. 














SHORT-CUT METHOD APPLIED TO CURVILINEAR REGRESSIONS 295 


Estimating cost for new observations. We can now use the data for 
1938 and 1939, which we have disregarded to this point, to work out 
estimates for those years from the regression curves, by the same 
process shown in Table 69B. The values are: 


Year 


x. 

fUXi) 

fkXi) 

fUXi) 

X'l 

Xi 

2'" 

1938 

36.2 

90.0 

73.0 

14.5 

-4.3 

83.2 

80.5 

-2.7 

1939 

60.7 

89.7 

63.1 

14.2 

-3.0 

74.3 

7^.0 

1.7 


Just as in the similar example in Chapter 14, it is necessary to 
extrapolate two of the regression curves beyond the base data in 
making this estimate for subsequent years. In spite of the additional 
possibility of error which this introduces, both of the new estimates 
show residuals no larger than 81 /( 234 ,). "This indicates that the 
changes in steel costs during these next two years were in general 
related to the same factors as during earlier years and to about the 
same degree. (The student can check this conclusion by adding these 
two new observations to the original data, and re-analyzing the re- 
sulting sample of twenty observations.) If the trend or other factors 
were extrapolated much further, or if a sudden change in the conditions 
surrounding the industry were to occur, much larger errors of estima- 
tion might be experienced. 

Restating short-cut results for publication. The same methods 
described on pages 247 to 254 of Chapter 14 can be used with curve s 
obtained by the short-cut process, to prepare them for publication. 
There is a shorter method, however, which takes advantage of the fact 
that the curves obtained by the short-cut method are already in terms 
of a net value of X^, for one variable, plus adjustments to that value 
for the other variables. All that is necessary is to determine the 
average value of the final z^s and use this average as the a constant. 
(In the illustrative example just given, this average was only 0.07, 
and consequently was ignored.) Then the final functions are de- 
termined as follows (for the final curves of the illustrative problem) : 

F^iX^) = a + f^'{X2) 

FM = 

FA(r,) = flixd 

It is evident that, except for the sliglit adjuhitrncnt of adding a to 
the first curve, these curves are the same as tlie final curves shown on 
Tigures 54, 55, and 56. 



396 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


Identifying ‘‘joint” relations by the short-cut process. In some 
problems the relation between the variables is such that the inde- 
pendent variable cannot be explained fully by a regression equation 
which adds the regression of Xi on variable -X'2, to that on X3, etc. 
Instead, in such cases the relation is so complex that the net change 
in with given changes in X2 will vary with the associated values 
of X3 or other variables. This type of relationship, designated “joint 
correlation,” is discussed subsequently (Chapter 21) . Where such cor- 
relation is present, it will show up in the process of examining the 
subgroups of observations in the first steps of the short-cut process. 

The following empirical data will serve to illustrate the occurrence 
of joint correlation: 


Observation 

Number 


X 2 



1 

216 

9 

4 

0 

2 

160 

10 

8 

2 

3 

140 

2 

7 

10 

4 

264 

4 

11 

6 

6 

30 

5 

2 

3 

6 

56 ! 

7 

1 

8 

7 

5 

1 

5 

1 

8 1 

16 

2 

2 

4 

9 

70 

2 

5 

7 

10 

126 

7 

6 

3 

11 

180 

10 

3 

6 

12 

280 

5 

7 

8 

13 

120 

3 

4 

10 

14 

25 

1 

5 

5 

15 

224 

4 

8 

7 

16 

120 

6 

0 

2 


The number of cases here is so small that it is difficult to eliminate 
the effects of X3 and X4, to determine the first approximation to the 
X^X2 relation. An approximate grouping can be made, however, by 
classifying the observations into three groups, as follows: 

One, those with X3 and X4 both larger than their respective means. 

Two, those with X3 and X4 both smaller than their respective 
means. 

i^From Wilfred Malenbaum and John D. Black, The use of the short-cut 
graphic method of multiple correlation, Quarterly Journal of Economics, Vol. LII, 
p. 97, November, 1937. 



IDENTIFYING ^^JOINT” RELATIONS BY SHORT-CUT PROCESS 297 


Three, those with and Z4 one above and one below their re- 
spective means. 

This gives groupings with four observations (3, 4, 12, and 15) in 
the first group, four (5, 7, 8, and 14) in the second, and eight (1, 2, 6, 9, 
10, 11, 13, and 16) in the third. Plotting each of these groups of obser- 
vations, and drawing an approximate line through each, gives the 
results shown in Figure 57. 



Fig. 57. Relation of to -X’ 2 , with observations -classified on X;i and Xi . Whon 
natural numbers are used, the net regression of Xi on appears to shift with the 
accompanying values of X:j and X4. 

This figure differs from those we have examined previously (such 
as Figure 44 on page 272 or Figure 52 on page 284) in that the relations 
as shown by the several subgroups do not parallel one another at 
relatively constant distances, but instead diverge sharply. It appears, 
therefore, that the relation of Xj to X2 dejocnds not only on the value 
of X2 but also on the associated values of X3 and X4. 

In this particular case the progressive nature of the relations shown 
on Figure 57 might lead us to suspect that the relation, instead of being 
an additive one, is a multiplying one. If that is the ease, though it 
could not be represented adequately by an equation of the type: 

Xi = /2(X2) +/3(X3) +/4(X4) 




298 GEAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


it still might be represented by: 

= [<t>2{X2)] iUXs)] [4>,(.Xi)] 

If that is the case, it can be determined by using the relation: 

log Xi = /2 aog X2) + /3 Gog X3) + fi Gog Xi) 

We can test whether this is likely to give a satisfactory fit by replotting 
Figure 57 on double logarithmic paper, or by plotting it on ordinary 
paper, substituting the logarithms of and Xg for the natural values. 
Let us do the latter. 


Logarithms of X/ 

2.5 

1 1 I I 1 

^ A--" . - 

A .. 

2.0 

- 


0 

9 06 

1.5 


1 

■ 14 S 

1 . Q 

OBSERVATIONS WITH Xj AND X4 
& Above their means 

0.5 

0 One oboire, one be/ow their means - 


■ Bof/7 />e/otv their means 


0 0.2 0 .^ 0.6 0.8 1.0 
Logarithms of 

Fia. 68. When the logarithms of the data shown in Figure 67 are used, the net 
regression of Xi on X 2 ia found to be about the same, regardless of the accom- 
panying values of Xz and X^. 

When that is done, the relations appear as shown in Figure 58. The 
three lines, fitted roughly to the three sets of observations, now appear 
more nearly parallel. In particular, the line of the upper group, which 
in Figure 57 made almost a 60° angle with the line for the lower group, 
is almost perfectly parallel to it in Figure 58. Apparently in this ex- 
ample the problem can be handled satisfactorily by the usual short- 
cut procedures, merely by transforming the variables from natural 
numbers to logarithms. 

Where this transformation, or other simple transformations, do not 
serve to make the successive sub-groups show approximately parallel 
relations, the methods of Chapter 21 must be employed instead. 

Application of the short-cut method to large samples. The short- 
cut method might be applied to samples too large to plot the indi- 




SHORT-CUT AND MATHEMATICAL PROCEDURES 


299 


vidual observations separately, by using a modification of the process 
of subgrouping and averaging illustrated in Chapter 11. The aver- 
ages from Table 42, plotted in Figures 30 and 31, indicated quite 
well the final slope of the net regression lines. That was because 
the influence of the other independent variable had been largely held 
constant by the process of subclassifying. In the same way the 
lines of averages from subgroups would tend to indicate the regres- 
sion curves in problems where curves were needed. With a sufficient 
number of observations, the first approximation to each of the net 
regression curves might be obtained from charts of subaverages simi- 
lar to Figures 30 and 31 on page 183. These several first approxima- 
tion curves could then be made the basis for working out estimated 
values of and residuals. The process of successive approximations 
could then be continued exactly as illustrated in Chapter 14. Since the 
first approximation curves would approach fairly near to the true net 
regressions, the number of approximations required to obtain the same 
closeness of fit would usually be less than by the earlier method. 

Combination of short-cut procedures and mathematical procedures. 
Both the short-cut method of this chapter and the longer successive- 
approximation method of Chapter 14 depend on graphic methods in 
arriving at the curves of best fit. Where especially high accuracy is 
desired, the final slope of the several curves can be checked by least 
squares, according to the methods set forth in Chapter 22 on pages 401 
to 403. 

Some investigators prefer to use the short-cut method to determine 
the approximate shapes of each of the several net regression curves, 
and then to fit mathematical net regressions capable of representing 
those several shapes. The technique for fitting these mathematical 
curves to several variables is also set forth in Chapter 22 on pages 396 
to 401. If there is a logical basis to support the curves employed, 
there is some value to this procedure. If the equations are simply 
selected empirically, however, the mathematical curves have no more 
meaning than the graphic ones, for the reasons already discussed fully 
in Chapter 6. It is true that any one fitting the same set of mathe- 
matical curves to the same data by the same method will get exactly 
the same result, to the fifth decimal place in the values of the constants, 
if desired. Curves obtained by different investigators by cither graphic 
process, on the contrary, may vary slightly from one to another. But 
the identical constants obtained by the least-squares fit have only 
a fictitious accuracy, as compared with their standard errors, or with 
the zone of uncertainty within which the function can be determined 


300 GRAPHIC METHOD FOR MULTIPLE CURVILINEAR REGRESSION 


from the given set of observations. Multiple regression curves are 
significant only with respect to this zone, rather than to the exact line 
(as explained fully in Chapter 18). With proper care in analyzing 
the data for interrelationships and in carrying through the successive 
approximations, as explained in Chapter 14 and in this chapter, either 
graphic method will ordinarily give results about as significant, within 
their error zone, as results obtained by the more laborious methods of 
fitting mathematical curves by extensive arithmetic calculations. 

Summary. Under certain conditions first approximations to mul- 
tiple regression lines or curves may be obtained directly from the 
original observations by a graphic process based on the comparison 
of individual observations, considering several variables simultane- 
ously. This process eliminates the necessity of computing linear re- 
gressions by arithmetical means. Further, it substitutes graphic 
measurements for arithmetic calculations in correcting these curves 
to their final shape by successive approximations. It requires the re- 
searcher to examine his data more thoroughly and so to exercise 
thought and care in working out the relations and in interpreting their 
significance. Carefully used, it materially reduces the time required 
in determining multiple regression curves. 

Note 1, Chapter 16. In view of the extensive discussions which have occurred 
concerning the validity of the short-cut method, certain key articles on this point 
are listed here. 

Waite, Warren C., Some characteristics of the graphic method of correla- 
tion, Jour. Amer. Stat. Assoc., Vol. XXVII, pp. 68-70, March, 1932. 

Ezekiel, Mordecai, Further remarks on the graphic method of correlation. 
Jour. Amer. Stat. Assoc., Vol. XXVII, pp. 183-185, June, 1932. 

Malbnbaum, W., and J, D. Black, The use of the short-cut graphic method 
of multiple correlation, Quart. Jour. Econ., Vol. LII, pp. 66-112, Novem- 
ber, 1937. 

Bean, L. H., and Mordecai Ezekiel, The use of the short-cut graphic method 
of multiple correlation, Comment, and Further comment, Quart. Jour. 
Econ., Vol. LV, pp. 318-346, February, 1940. 

Wellman, H. R., Application and uses of the graphic method of multiple 
correlation. Jour. Farm Econ., Vol. XXIII, pp. 311-316, Febniary, 1941. 

Waite, Warren C., Place of, and limitations to, the method, Jour. Farm 
Econ., Vol. XXIII, pp. 317-322, February, 1941. 

Working, E. J., and Geoffrey Shepherd, Notes on the place of the graphic 
method of correlation analysis, Jour. Farm Econ., Vol. XXIII, i)p. 322- 
323. 

Foote, Richard J., and J. Russell Ives, The relationship of the method of 
graphic correlation to least squares, U. S. Department of Agriculture, 
Bureau of Agricultural Economics, mimeographed report, December, 
1940. 



SUMMARY 


301 


These discussions, especially the report by Foote and Ives, and an address by 
Meyer A. Girshick at the same meeting, as summarized in the February, 1941, 
Journal of Farm Economics^ have provided definite proof of the meaning of the 
graphic method. They have shown that in linear multiple correlation the graphic 
method gives results which tend to approach the lines secured by a least-squares 
solution, even if the fii-st approximations are purely arbitrary guesses. Further, 
they have shown that the speed of convergence depends on the intercorrelation 
among the independent variables. The higher their intercorrelation, the slower 
tends to be the speed of the convergence. 

The discussion and procedures in this chapter, as now revised, take into account 
these recent examinations of the meaning of the short-cut graphic method, and 
incorporate the most useful and significant suggestions to the student which have 
come out of them. 

Note 2, Chapter 16. The comments made in the note on page 258 apply to 
Chapter 16 as well. If the standard error of estimate is calculated (as shown 
on pages 293 and 294) as each new set of approximation curves is completed, it 
will show whether the gain in closeness of fit is sufficient to offset any additional 
flexibility introduced in the cur\^es. The validity of this test, however, depends 
upon the user’s skill in estimating the value of m to employ. 



CHAPTER 17 


MEASURING THE WAY A DEPENDENT VARIABLE CHANGES 
WITH CHANGES IN A NON-QUANTITATIVE INDEPENDENT 

FACTOR 

It is frequently desirable to determine the change in one variable 
associated with changes in an independent factor which varies in 
such a way that it cannot be measured quantitatively. Thus if the 
significance of various factors affecting farm values is to be deter- 
mined, one may wish to include type of road as one of the factors, 
since a farm on a concrete road should be expected to be worth 
more than one on a dirt road, other factors being the same. Yet the 
designations, concrete, brick, macadam, gravel, and dirt, cannot be 
considered in the correlation analysis in the way that the numbers 
measuring variable factors are treated. 

Where no other factors are involved, a non-quantitative factor may 
be treated by sorting with respect to that factor, and averaging the 
dependent variable. Thus if only type of road is being considered, 
the average value per acre of farms fronting on each type of road 
may be taken as the measure of the influence of roads on value. If, 
however, several other factors must be considered at the same time, 
such as value of improvements, productivity of the soil, distance 
from town, etc., and if there is any relation between differences in 
these factors and differences in road type (as in general there will 
tend to be), the influence of road type must be measured by some 
application of multiple correlation methods. Fortunately the methods 
of multiple curvilinear correlation, as presented in Chapters 14, 15, 
and 16, can be extended to treat non-quantitative factors as well, 
and thus provide the answer to the difficulty. 

Eliminating the influence of other variables. The method of de- 
termining regressions for non-quantitative variables may be illustrated 
by the data shown in Table 70. These data are from a study of the 
relation of various quality factors to the price of eggs sold at retail.^ 
The factors shown in Table 70 are X 2 , an index of the interior quality 

1 Original data collected by C. B. Howe. See reference 42 of Chapter 23. 

S02 



ELIMINATING THE INFLUENCE OP OTHER VARIABLES 


303 


TABLE 70 


Data fob Ego FbobiiEu, with a Non-Quantitattvii Indbpemdent Vabiable 


Independent variables 

Dependent 

variable, 

Xi 

z"' 


^//// 

Xz 

X, 

Xi 

X, * 

21 

23 

4 

C 

35 

- 7.3 

+0.6 

- 7.9 

35 

24 

12 

C 

45 

- 8.4 

+0.6 

- 9.0 

26 

23 

12 

B 

55 

3.4 

+0.9 

2.5 

27 

24 

12 

B 

55 

3.3 

+0.9 

2.4 

31 

22 

12 

A 

50 

- 1.8 

-2.5 

0.7 

35 

24 

12 

C 

44 

- 9.4 

+0.6 

-10.0 

28 

23 

12 

C 

60 

8.2 

+0.6 

7.6 

41 

23 

12 

B 

60 

- 4.8 

+0.9 

- 6.7 

28 

26 

2 

C 

45 

- 1.6 

+0.6 

- 2.2 

24 

23 

11 

B 

52 

4.6 

+0.9 

3.7 

28 

20 

12 

C 

45 

- 5.5 

+0.6 

- 6.1 

49 

24 

12 

C 

55 

~ 3.6 

+0.6 

- 4.2 

30 

24 

12 

C 

55 

2.4 

+0.6 

1.8 

48 

23 

12 

B 

60 

1.9 

+0.9 

1.0 

19 

22 

9 

C 

45 

1.8 

+0.6 

1.2 

22 

23 

3 

A 

45 

1.7 

-2.5 

4.2 

33 

25 

12 

C 

60 

6.6 

+0.6 

6.0 

26 

24 

12 

C 

59 

6.9 

+0.6 

6.3 

35 

23 

12 

B 

55 

2.1 

+0.9 

1.2 

20 

23 

12 

B 

50 

- 0.9 

+0.9 

- 1.8 

25 

25 

12 

B 

55 

2.6 

+0.9 

1.7 

46 

24 

12 

B 

60 

2.5 

+0.9 

1.6 

30 

26 

1 

B 

45 

- 3.2 

+0.9 

- 4.1 

24 

24 

12 

B 

55 

3.1 

+0.9 

2.2 

48 

23 

12 

B 

60 

1.9 

+0.9 

1.0 

17 

22 

12 

C 

55 

4.8 

+0.6 

4.2 

18 

22 

12 

A 

45 

- 5.3 

-2.5 

- 2.8 

41 

24 

12 

C 

55 

- 0.3 

+0.6 

- 0.9 

30 

25 

12 

c 

67 

14.0 

+0.6 

13.4 

19 

24 

2 

B 

53 

8.3 

+0.9 

7.4 

47 

24 

0 

B 

55 

0.9 

+0.9 

0.0 

32 

24 

12 

B 

55 

2.2 

+0.9 

1.3 

26 

24 

12 

B 

49 

- 3.1 

+0.9 

- 4.0 

38 

24 

12 

A 

42 

-12.2 

-2.5 

- 9.7 

29 

23 

12 

B 

42 

- 9.9 

+0.9 

-10.8 

24 

24 

0 

A 

45 

- 2.9 

-2.5 

- 0.4 

37 

25 

12 

A 

40 

-14.3 

-2.5 

-11.8 

36 

23 

12 

A 

4<S 

- 5.1 

-2.5 

- 2.6 


* A designates “ sold without carton/' B " sold in carton but unbranded," and C " sold ii 
carton with brand name." 



304 TREATMENT OF NON-QTJANTITATIVE INDEPENDENT FACTORS 


TABLE 70 — Continued 


Independent variables 

Dependent 

variable, 

Xt 

2"' 

/(X«) 

2 "" 

H 

B 


X,* 

10 

23 

■■ 

B 

47 

1.2 

4-0.9 

0.3 

35 

24 


C 

59 

5.6 

+0.6 

5.0 

22 

22 

WM 

B 

52 

1.2 

+0.9 

0.3 

29 

21 

12 

B 

55 

4.0 

+0.9 

3.1 

16 

23 

0 

B 


- 6.5 

+0.9 

- 7.4 

6 

22 

3 

B 


- 1.0 

+0.9 

- 1.9 

31 

23 

12 

B 

55 

2.8 

+0.9 

1.9 

26 

23 

12 

B 

55 

3.4 

+0.9 

2.5 

36 

21 

12 

B 

60 

7.8 

+0.9 

6.9 

39 

22 

12 

B 

55 

1.4 

+0.9 

0.5 

42 

23 

12 

B 

60 

4.8 

+0.9 

3.9 

36 

24 

12 

C 


6.4 

+0.6 

5.8 

47 

22 

12 

B 


2.8 

+0.9 

1.9 

27 

24 

12 

c 

55 

2.8 

+0.6 

2.2 

31 

22 

12 

A 

50 

- 1.8 

-2.5 

0.7 

26 

22 

11 

A 

40 

- 7.2 

-2.5 

- 4.7 

45 

23 

12 

A 


3.5 

-2.5 

6.0 

18 

25 

12 

C 

45 

- 6.6 

+0.6 

- 7.2 

35 

24 

12 

C 

50 

- 3.4 

+0.6 

- 4.0 

21 

23 

12 

C 

55 

4.0 

+0.6 

3.4 

44 

23 

12 

A 


3.9 

-2.5 

6.4 

48 

24 

12 

A 

55 

- 3.6 

-2.5 

- 1.1 

33 

24 

12 

A 

55 

2.0 

-2.5 

4.5 

47 

24 

12 

C 

55 

- 3.1 

+0.6 

- 3.7 

16 

22 

5 

A 

45 

3.9 

-2.5 

6.4 

32 

25 


B 


0.8 

+0.9 

- 0.1 

45 

25 

12 

B 

55 

- 2.4 

+0.9 

- 3.3 

46 

23 

12 

B 

57 

0.0 

+0.9 

- 0.9 

32 

24 

12 

C 

55 

2.2 

+0.6 

1.6 

16 

23 

1 

C 

41 

- 4.2 

+0.6 

- 4.8 

30 

25 

1 

C 


2.3 

+0.6 

1.7 

24 

22 

mm 

A 

42 

- 5.0 

-2.5 

- 2.5 

44 

24 

11 

B 

50 

- 2.6 

+0.9 

- 3.5 

25 

22 

12 

B 

49 

- 2.1 

+0.9 

- 3.0 

16 

23 

mm 

A 

45 

- 1.5 

-2.5 

1.0 

31 

24 

8 

A 

48 

3.2 

-2.5 

5.7 


* A designates “sold without carton,” B “sold in carton but uiibranded,” and C “ sold in 
carton with brand name.” 
















DETERMINING THE NET INFLUENCE OF THE NEW VARIABLE 305 


of the eggs in each dozen; X3, the weight of each dozen in ounces; 
X4, the number of white eggs in each dozen; X5, the type of carton the 
eggs were sold in; and Xi, the price of eggs per dozen, in cents. Net 
curvilinear regressions have been determined for the three quantitative 
factors by the successive approximation method, and estimated prices 
have been worked out by the regression equation 

X; = a' +/2(X2) +fz(Xs) +f,(Xd 

The residuals, 2;"', obtained by subtracting these estimated prices from 
the observed prices, Xj, are shown in the table. The values in the last 
two columns are explained later. 

Determining the net influence of the new variable. The first step 
in determining the net regression of X^ on X5 is to group the resid- 
uals from the previous curves, z'", according to the new factor X5, 
and determine the average for each group. This gives results as 


Value of Xb Average of 2 "' 

A — no carton —2.5 

B — carton +0.9 

C — carton and brand name +0.6 


These results show that, after making allowances for the size, 
color, and quality of the eggs, those with unmarked cartons sold 3.4 
cents above those sold in bulk, on the average, but those with branded 
cartons sold only 3.1 cents above eggs in bulk. These results can- 
not be accepted as the final effect of package on price without first 
raising the question whether the curves previously determined to 
show the influence of the other factors might be changed somewhat 
were the type of package taken into account. Whether this will be 
true or not depends upon whether there is any correlation between 
the new factor and the factors previously considered, or whether they 
are quite independent of each other. This can be determined by 
sorting the other factors according to the values of X5, and determin- 
ing their averages for each group. The results are: 



Averages of other independent variables 

Number of 

Value 01 A B 

Xi 

^3 

Xi 

cases 

A — no carton 

30.6 

23.1 

8.6 

17 

B — carton 

31.6 

23.2 

9.6 

33 

C — carton and brand 

29.9 

23.8 

10.2 

24 



306 TREATMENT OP NON-QUANTITATIVE INDEPENDENT FACTORS 


There^ does seem to be some correlation between Xs and the 
other variables. Apparently the eggs sold in unmarked cartons are, 
on the average, of the best quality and of medium size; the eggs sold 
in cartons under brand names are of larger size, but are not of such 
high quality, on the average; whereas those sold in bulk average 
medium in quality but low in size.^ Accordingly, the curves previ- 

^ The exact correlation between X5 and X2, Xs, and X4. can be computed by 
estimating each of the other variables from the values of X5, using the averages 
of X2, Xs, and X4 for each group of X5 as the estimated values of X2, X3, and X4, 
for the cases falling in each group. The residuals between the estimated and actual 
values, and their standard deviation, can then be computed for each of the three 
variables. Then the indexes of correlation can be computed in the usual way. 
When computed this way by using group averages instead of a continuous function, 
the special name correlation ratio is given to the correlation, and the symbol ti is 
used to designate it. This value may be more rapidly computed by the following 
formula (using Y to represent the dependent variable, and X the independent vari- 
able, just as with simple correlation in Chapters 5 to 7 ) : 



Here tiv« is the correlation ratio for Y values estimated from group averages when 
sorted on X; tic (Mo) ^ is the number of cases in each group times the square of the 
average value of Y for that group, 2 C 7 To(Mo) 2 ] is the sum of all such values, Oy is 
the standard deviation of the variable being estimated, and n is the number of all 
the observations (= 2 t2o) . 

The process may be illustrated by calculating ri26, the correlation ratio between 
X2 and Xs, from the data above : 



The value as calculated is subject to the same correction equation ( 26 ) as the cor- 
relation index, with m = number of groups. 

So adjusted, TI25 shrinks to 0 , showing no real correlation. 

This same measure of correlation by group averages can be applied to quan- 
titative variables as well as to non-quantitative ones, but in that case it has less 
significance than the index of correlation, which relates to a continuous function 
instead of an irregular line of averages. 




NON-QUANTITATIVE VARIABLE IN ESTIMATING Zi AND 2 307 


ously determined for the change in price with differences in size and 
in quality may have included some portion of the effect really asso- 
ciated with cartons instead. Now that at least an approximate 
measure has been obtained of the influence of carton on price, the 
previous curves may be modified by taking this factor also into 
account. 

Taking account of the non-quantitative variable in estimating 
and z. The first steps in the procedure of allowing for the extent to 
which prices varied with the carton are shown in Table 70. In the 
column headed f{X^) the approximate influence of differences in carton 
on price are entered, the averages found in the tabulation on page 305 
being used. Since these values would be added to the previous esti- 
mated values of Xi to obtain the new estimates, they may instead 
be subtracted from the previous residuals (z'") to obtain the revised 
residuals. The last column shows these new values for z"". Before 
using these new values to see if any changes are necessary in the other 
regression curves we may first determine how much the standard error 
of estimate has been reduced by taking Xq into account. This could 
be determined directly by computing the standard deviation of the 
new z"" values ; but a much shorter method is available, using the same 
principle employed in footnote 2. By the use of this method, the(r^,,„ 
may be computed from the by the formula 





The necessary computations are: 



Mz'" 

Number of cases 

nMo 

n(Mo)2 

A 

-2.5 

17 

-42.5 

106.25 

B 

0.9 

33 

29.7 

26.73 

C 

0.6 

24 

14.4 

8.64 



Sums 

1.6 

141.62 




308 TREATMENT OF NON-QUANTITATIVE INDEPENDENT FACTORS 

Computing the standard error for estimates based on Xq and the other 
variables, we must recognize that the value of m has been increased by 
three by the introduction of the new factor; so, whereas m was assumed 
to equal 8 previously, it now equals 11. Adjusting the values of 5.06 
for and 4.87 for by equation (65), we find jSi./( 2,3,4) = 5*36, 
and iSi./(2,3,4,5) == 5.27. Apparently the introduction of Xq as a 
factor has had as yet but slight effect on the accuracy with which egg 
prices might be estimated. 

Making further successive approximation corrections. It is still 
possible, however, that the regressions for the other factors might be 
modified now that Z5 has been at least approximately allowed for. 
Consequently the values of are classified according to the values of 
X 2 , Xs, and X4, and the averages computed for each group. The 
averages given in Tables 71, 72, and 73 are secured. The averages in 
Table 71 suggest that the curve for /2 (X2) might be modified slightly, so 
as to rise more steeply in the portion up to X 2 = 40 and less steeply 
thereafter. Table 72 does not indicate any consistent relation between 
X 3 and 2"'', so no further change in fziX^) is indicated. Table 73 
indicates that the curve for /4 (X4) might also be altered slightly, so as 
to have a somewhat steeper slope. 


TABLE 71 

Average Values of 2"" for Corresponding X2 Values 


X 2 values 

Number of cases 

Average of X 2 

Average of z"" 

0-14 

2 

8.0 

-0.9 

15-19 

9 

17.2 

-0.2 

20-29 

23 

25.1 

+0.1 

30-39 

24 

33.5 

+0.3 

40-49 

16 

45.5 

-0.1 


If f 2 {X 2 ) and /4(X4) were modified as suggested, a new estimated 
value of Xi might then be worked out, using these new curves and the 
previous curve for /sCZs), and using the values for f^iX^) already 
entered in Table 70. The new z^s based on these new estimates might 
then be classified with respect to X5, to determine if any change need 
be made in the values for fs(X^) worked out on page 305. If any 
material change were found necessary in X^, the residuals might be 
corrected accordingly, and then averaged with respect to Z2, X3, and 
X‘4, to see if any further changes would be needed in their values. 






FURTHEE SUCCESSIVE APPROXIMATION CORRECTIONS 309 


This process of successive approximation should be continued until no 
further significant change was indicated in any of the curves, or until 
the Sij( 2 , 3 , 4 . 5 ) showed no further reduction. 

TABLE 72 


Average Values op 2"" for Corresponding X3 Values 


Xz values 

Number of cases 

Average of z"" 

20 

1 

-6.1 

21 

2 

5.0 

22 

13 

0.1 

23 

23 

0.2 

24 

25 

-0.1 

25 

8 

0 

26 

2 

-3.2 


In view of the fact that none of the averages of 2 :"" shown in Tables 
71 to 73 are so large but what they might very readily have occurred 
by chance, it does not seem worth while, in this problem, to carry out 
the additional steps just outlined. In a problem where the non- 
quantitative factor is an important one, however, and where it is 

TABLE 73 


Average Values of z"" for Corresponding X4 Values 


Xi values 

Number of cases 

Average of A ''4 

Average of z"" 

0 

7 

0 

- 1.3 

1- 2 

3 

1.4 

-0.4 

3- 5 

4 

3.8 

-1-0.2 

8-11 

5 

10.0 

+0.5 

12 

53 

12.0 

+0.2 


significantly correlated with the other independent variables, the de- 
termination of the net function for that factor should be carried through 
a sufficient number of approximations to measure the final net effect of 
^ach factor as accurately as possible. 

Taking the preliminary results shown on page 305 as the final 
measure of the influence of type of container on price, we may then 
conclude that eggs sold in an unmarked carton brought, on the average, 







310 TREATMENT OF NON-QUANTITATIVE INDEPENDENT FACTORS 

3.4 cents more per dozen than eggs of the same quality, size, and color 
sold in bulk, and 0.3 cent more than eggs sold in a carton with a brand 
name. (This last result might reflect the experience of consumers with 
branded eggs of poor quality, as indicated in the tabulation on page 305, 
which might tend to make them sell at a discount even when they were 
of equal quality.) The significance of the relation may be measured 
by the slight reduction in the standard error of estimate previously 
noted, or else by the increase in the index of multiple correlation. 
Computing the indexes of multiple correlation corresponding to the 
standard errors of estimate before and after the type of carton is 
allowed for, by equation (66.2), we find them to be P1.234 = 0-59; 
Pi. 2345 ~ 0-62. The corresponding indexes of determination, 35 and 
38 per cent, indicate that taking into consideration the differences in 
the carton has increased the proportion of egg prices which can be 
explained by 3 per cent of the original variance, even after due allow- 
ance is made for the additional constants the process introduces into 
the estimating equation. 

It should be noted that the first approximation to the regression 
on non-quantitative factors can be made directly from the first set of 
residuals, computed from the linear multiple regression equation, 
instead of waiting until after approximate regression curves are de- 
termined for the other factors. In case a non-quantitative factor is 
a very important one, so that ignoring it in determining the net linear 
regressions may seriously impair their accuracy, it may be roughly 
included by designating successive groups by a numerical code which 
approximates the expected influence of the variable. Then if the 
true influence is of a different order from the expected influence, 
that fact will show up when the first approximation curves are worked 
out. (For the non-quantitative factor the^averages of residuals must 
be interpreted as discrete points for each class, however, rather than 
as a continuous function.) Thus for the egg problem it might have 
been tentatively assumed that eggs in branded cartons would sell 
above eggs in unbranded cartons, and both would sell well above 
eggs in bulk. The bulk eggs could then have been designated by 1; 
the unbranded cartons by 3; and branded cartons by 4. The net 
linear regression would have been positive; but the analysis of the 
residuals would have revealed that the eggs in branded cartons really 
averaged lower in price (other factors equal) than the eggs in un- 
branded cartons, so the final conclusion would probably be much the 
same as the one just determined. 



SUMMARY 


311 


Summary. Where an independent factor is not a continuous 
variable, but may be classified into two or more groups, the regression 
of a dependent factor may be determined with respect to each group, 
while holding other factors constant by the usual multiple correlation 
process. Standard errors and indexes of correlation may be worked 
out to include the effects of non-quantitative independent factors 
equally as well as for continuously variable factors. 



CHAPTER 18 


DETERMINING THE RELIABILITY OF CORRELATION 
CONCLUSIONS 

Early in this book it was pointed out that when any statistical 
measure, such as an average, is determined from a sample selected 
from a universe under study, the true value of that measure in the 
universe might be different from the value shown by the sample. 
Methods were discussed which enable one to estimate how far the 
average from such a sample may vary from the true average, for a 
stated proportion of such samples. Such estimates enable one to judge 
how much confidence may be placed in an average calculated from a 
given sample. 

Simple Correlation 

Regression coeflGlcients. Correlation constants determined from 
finite samples are just as subject to variation as are other statistical 
constants. Thus in an experiment 5 samples of 30 observations each 
were drawn at random from the same universe. The true value of 

TABLE 74 

Values of hyx Secured in Successive Samples Drawn from the Same Universe, 
WITH Different Numbers of Observations 


30 observations 

50 observations 

100 observations 


0.292 

0.176 

0.113 


0.012 

-0.297 

0.120 


-0.136 

0.144 

0.303 


-0.022 

0.130 

0.197 


0.449 

0.167 j 

0.132 

True value 

0.152 

0.152 

0.152 


byx for the universe was 0.152. The regression of Y on X was de- 
termined separately for each sample. The values for byj, which were 
secured from the 5 samples varied from —0.136 to -f 0.449, as shown 
in Table 74. When 5 samples of 50 observations each were drawn, and 

312 




SIMPLE CORRELATION 


313 


the regressions computed for each, the range was reduced to —0.297 
to +0.175; but the variation between samples was still large. Even 
when 100 observations were included in each sample, the regressions 
were by no means identical, though the range was reduced still more. 

It is evident that the observed values of by^, fell both above and 
below the true value for the universe from which the samples were 
being selected.^ It is also evident that the smaller the number of 
observations, the larger the variation in the results between different 
samples and the greater the possibility of a serious difference between 
the true value and that indicated by the sample. The amount of 
variation likely to be present in regressions determined from random 
samples of any specified size may be estimated by the equation 

Standard error of hyx = ■ — (69) 

(TxV n 

Since this constant is computed from the adjusted value, Sy^xj no further 
adjustment is required. 

If only one of the samples in Table 74 had been obtained — say the 
first one with 50 observations — the observed value for byx would have 
been +0.175. The standard error of estimate for this sample was 
2.46, and the cr^ was 2.44. Computing the standard error of by^ for 
this sample by means of equation (69), 


2.46 _ 2.46 

2.44^^ ” 17.25 


0.143 


the value of by^, as determined from this single sample, may therefore 
be stated to be 0.175 ± 0.143. 

Tlie standard error of tlic regression coefficient is interpreted exactly 
the same as the standard error of the average was interpreted in 
Chapter 2. In two samples out of three, on the average, the observed 
regression will miss the true regression by not more than one standard 
error calculated from the sample. Therefore, if in this case we say 
that the true regression lies between 0.175 — 0.143 and 0.175 + 0.143, 
or between 0.032 and 0.318, we arc making a statement of a type which, 
if made for a succession of such samples, will be wrong one time out of 
three, on the average. Similarly, if w'c said that tlu^ true regression 


^ In some textbooks, bux would he u.sed to reproscut I, he rej»;r(‘ssion us determined 
from the sample and would be used to represent the (rue value of tlu^ (*orre- 
sponding regression in the universe from wliich the sample was drawn. In this 
notation, in Table 74, the value for — 0.152. In consulting t,(‘xtbooks using 
this notation, we should not confuse this use of the p with the si>cH’i{d definition 
given it in Chapter 13, equation (52). 



314 


RELIABILITY OF CORRELATION RESULTS 


probably lies between —0.111 and 0.461, i.e., within a range of twice 
the standard error from the observed value, we are making a statement 
of a kind which, if made for a series of samples, will be wrong in one 
sample out of twenty, on the average. 

It happens, in this particular case, that four out of five of the 
observed regressions (for samples of 50) fall within one o-& of the re- 
gression from the first sample.^ It also happens that the true value 
also falls within that range. This will not always be true, however. 
For example, if the sample had happened to give the same results as 
the third sample of 30 observations, with by^ = —0.136, the case might 
have been different. For that sample, the values of the other con- 
stants were such as to make <ti = 0.109. The value of bya, as indicated 
by this sample, therefore, —0.136 d= 0.109, is such that the observed 
value lies 2.6 times its own standard error from the true value, 0.152. 
Although a departure as large as this would ordinarily be expected 
to occur only once out of every 100 samples on the average (0.009), 
still it may happen with any particular sample.® For that reason, if 
very great accuracy is desired, a range of three times the standard 
error may be used as the criterion. There is but one chance out of 
nearly 400 (0.0027) that a given random sample will yield a constant 
such as a regression coefficient which will fall more than three times its 
own standard error away from the true value for the universe. 

These probabilities apply only in case there are thirty or more 
degrees of freedom (n-m) in the sample. As was pointed out in 
Chapter 2, if the number of degrees of freedom is less than thirty, 
the probabilities of falling outside of any given range of the true value 
are increased, as shown in Table A on page 23. In using this table 
for regression coefficients, subtract 1 from the number of cases in the 
sample before looking the probability up in the table.^ 

Thus if a value of byx = 0.50 zh 0.12 were found from a random 
sample of 11 cases, the reliability of the observed regression could be 
judged from the column headed 10 in Table A. That column indicates 

2 A more precise way of stating this comparison would be to show a series of 
regressions from samples drawn from the same universe, such as those listed in 
Table 74, with each sample regression followed by ± its own standard error. If 
that were done, it would then be found that, in two samples out of three, on the 
average, the value hya + Ob would overlap the true value of hyr for the universe. 

3 Probability tables, such as that given in Table A of Chapter 2, or shown 
graphically in Figure A, page 505, list these odds for various multiples of the a. 

■^That is because two constants (a and b) have been determined simultaneously 
in the process of getting b, whereas the table is stated for arithmetic means, which 
represent the determination of only a single constant. (See page 22, footnote 7.) 



REGRESSION LINE 


315 


that, with samples of this size, 34 out of each 100 samples, on the 
average, would miss the regression in the universe by as much as 0.12 
(1 cTb) ; about 8 out of each 100 would miss by as much as 0.24 (2 o-b) ; 
and 15 samples out of each 1,000 would miss by as much as 3 ai, or 0.36. 
Thus in this case, if we say that the true value probably lies between 
0.14 and 0.86, we are making a statement of the sort which is likely to 
be wrong only once or twice out of each hundred such statements — if 
the sample was drawn under such conditions that the formulas of 
simple sampling hold true. 

It should be noted from equation (69) that the standard error of 
the regression coefficient varies inversely with the square root of the 
number of observations. The effect of this is illustrated in Table 74. 
The variation of the regression coefficients obtained from samples of 
100 observations is only about half as great as the variation of the 
regression coefficients from samples of 30. 

Regression line. Not only may the observed slope of the regression 
line vary from the true slope, but the elevation of the line, as observed 
from a sample, may vary from the true elevation. Formula (69) has 
already indicated a way of determining the standard error of the 
regression coefficient, and so of estimating the probable range within 
which the true slope lies. The height of the regression line is most 
accurately determined for the mean estimated value, of the de- 
pendent factor, corresponding to the observed mean value of X, the 
independent factor. If we define the mean as 

iWj/' ” CLyx ”1” X 


we may find its standard error by the formula 

Sy.x 



(70) 


The standard error of the whole regression line may now be deter- 
mined from equations (69) and (70) . We may illustrate by data from 
the cotton-yield problem used as an example in Chapter 8, on page 147. 
With 14 observations, the values were hy^, — 16.70, Oyj, = — 2.261, 
= 1.97, = 8.28, Ox = 0.73, My = Myf — 30.64, <7^ == 14.43. 


My^ = - 2.261 + (16.70) (1.97) - 30.64 
8.28 


Vl4 

8.28 


= 2.21 




0.73\/14 


= 3.03 



316 


RELIABILITY OF CORRELATION RESULTS 


Since the estimated value, F' equals Myf + b (a;) , the standard error 
of the estimate for any value of x will be composed of the sum of 
the standard errors of Myf and of h (a?) . Standard errors are standard 
deviations; hence they can be summed only by adding their squares 
(as demonstrated in Appendix 2, Note 1). The standard error of Y% 
for any particular value of Xj is therefore given by the equation ® 

= ^<^My' + (70.1) 

By using this relation, the calculation of the standard error of Y', 
for selected values of X, is shown in the following tabulation: 


Selected 

values 

of 

X 

Departures 

from 

mean 

X 

Calculation of cyt 

<rhy^ 

= (3.03®) 


b 

li 

2 

Cy, 

■\-0-^My'] 

Cyf 

0.97 

-1.00 

-3.030 

9. 1809 

4.8841 

14.0650 

3.75 

1.47 

-0.50 

-1.515 

2.2952 

4.8841 

7. 1793 

2.68 

1.97 

0 

0 

0 

4.8841 

4.8841 

2.21 

2.47 

0.50 

1.515 

2.2952 

4.8841 

7. 1793 

2.68 

2.97 

1.00 

3.030 

9.1809 

4.8841 

14.0650 

3.75 

3.47 

1.50 

4,545 

20.6570 

4.8841 

25.5411 

5.05 

3.97 

2.00 

6.060 

36.7236 

4.8841 

41.6077 

6.45 


There are 14 cases; subtracting the one extra constant involved in 
correlation determinations gives 13 as the number of observations with 
which to judge from Table A the significance of these standard errors. 
Taking values midway between those for 10 and for 16 cases, we find 
that the statement that the true values of bya, and of Myt do not differ 
from the observed values by more than the calculated standard errors 
will be wrong for 34 out of each 100 such statements, on the average. 
Similarly, the statement that they do not differ by more than twice the 
calculated standard errors will be wrong for 7 out of 100 such state- 
ments, on the average. The chances are therefore 93 out of 100 that 
the true regression line would fall within twice the standard errors just 
calculated. Plotting 2(ryf above and below the corresponding values of 
7', given by the regression line, shows this range. These limits are 

® Holbrook Working and Harold Hotelling, Applications of the theory of error 
to the interpretation of trends, Journal of the American Statistical Association 
Papers and Proceedings, xxiv, pp. 73-85, March supplement, 1929. 





REGRESSION LINE 


317 


plotted in Figure 59, together with the original observations and the re- 
gression line. The limits within which the line probably fell could be 
shown in a similar manner for any other desired limit of probability. 
It is now clear why great caution must be exercised in extending even a 
linear regression line beyond the range of the data from which it 
is derived. As is evident in the figure, the true position of the line 
becomes very uncertain as the limits of the data are approached, 
and increases rapidly beyond them. 



Fig. 59. Linear regression of cotton yield on irrigation water applied, and range 
within which the true relation probably lies. 

In many correlation problems, the regression line is the most im- 
portant result of the study. The confidence that can be placed in the 
line determined from a random sample is no greater than is indicated 
by the probable error of its slope, or the standard error zone of its 
position. Accordingly, the final statement of the regression coeflicient 
or regression line should always indicate clearly the standard error 
or probable error zone, and should also state the number of observa- 
tions on which the conclusions are based. This will serve to caution 
the reader of the extent to which the values may vary from tlie true 



318 


RELIABILITY OF CORRELATION RESULTS 


value simply due to chance fluctuations of sampling, and so caution 
him not to attach more importance to them than their significance 
justifies. 

Correlation coeflicients. In exactly the same way that regression 
coefficients will vary from sample to sample, all other statistical con- 
stants tend to vary. Regression coefficients from random samples 
tend to be normally distributed around the true value, so that the 
probability of a given departure from the true value occurring may 
be judged from the normal curve;® but that is not equally true of 
correlation coefficients. If the number of observations in the sample 
is exceedingly large, so that fairly stable results are secured, the dis- 
tribution of the observed correlations will tend to be nearly normal, 
so that the standard error may be estimated by the formula 

1 - 

Standard error of = — /— ^ — (71) 

Vn - 2 


This equation applies only when n is large, say 100 or more. To 
test the significance of correlation coefficients obtained from small 
samples, Fisher has developed the equation 


r's/ n — 2 


(71.1) 


The value t is used to judge the probability of the occurrence of 
such a correlation purely by chance, in exactly the same way that 
the number of times an average is times its standard error is used 
to judge the ■ probability of the significance of the average. Thus 
if a correlation of 0.60 is secured with a sample of 21 cases, t = 3.26. 
Looking up this value in Table A on page 23, or Figure A of Ap- 
pendix 3, using 20 for n/ we find that only in one sample out of 200 
random samples, on the average, would a value this large or larger 
be obtained from a universe with no correlation present. If, however, 
a correlation of 0.60 had been secured with only 7 cases, t would equal 

® The normal curve is the basis for the probability data given in the last column 
of Table A of Chapter 2. 

^ Equation (71) holds precisely true only when the value used for r is the true 
correlation in the universe, rather than the value observed in the sample. This 
limitation does not apply to equation (71.1). 

8 Just as with regression coefficients, 1 less than the number of cases should be 
taken for n when Table A is used to judge the significance of a correlation coeffi- 
cient. The unadjusted correlation, r, should be used in all tests of significance, not 
the adjusted value r. 



CORRELATION COEFFICIENTS 


319 


1.68. Figure A indicates that, with this value of the chances of 
getting a correlation this large or larger from random samples drawn 
from a universe with no true correlation would be almost 0.16. This 
means that out of 100 such samples obtained from a universe in which 
the true correlation was zero, 16, on the average, would show a cor- 
relation as high as 0.60.® 

Although this method may be used in conjunction with Table A 
to determine whether or not the correlations computed from small 
samples are any valid indication of a correlation in excess of zero, it 
cannot be used to determine the significance of the difference in cor- 
relation between two samples or to determine whether or not the 
correlation in a given sample exceeds any specific value. In the first 
illustration, for example, where r = -1- 0.60, one might wish to know 
the probability that the true correlation in the universe exceeds H- 0.20. 
Owing to the skewed distribution of values of r when computed from 
small samples, this cannot be determined by a simple sampling formula. 
R. A. Fisher has devised a method, however, of so transforming 
observed values of r as to give them a normal distribution, and then 
solving such problems as this from the transformed values. For 
methods of dealing with this phase of sampling, the reader is referred 
to his presentation of the method in Statistical Methods for Research 
Workers, seventh edition, pages 202 to 211. 

Certain of Fisher’s methods to determine the reliability of observed 
correlations may be put into more simple form for general use, as 
shown in Figure B in Appendix 3. This figure is based upon the idea 
that, although we cannot state the true correlation existing in the 
universe from the correlation shown in a given sample, we can estimate 
a minimum value for the true correlation, with a given chance of being 
wrong. Figure B has been calculated, by Fisher’s methods, to show 
such probable minimum correlations in the universe, with the prob- 
ability that the statements based on the figure will be wrong for 
1 sample out of 20, on the average. The results have been plotted 
for different sizes of sample and observed correlations. Thus if a 
random sample of 20 gives an observed correlation of 0.70, the figure 
shows at a glance that we can say that the true correlation is greater 
than 0.44, with the expectation that such statements will be wrong 
only once in twenty times, on the average. Similarly, for an observed 
correlation of 0.55 with a sample of 35 cases, reading from the line 

® See R. A. Fiaher, Stnti^final Methods for Research Workers, seventh edition, 
Oliver and Boyd, London and Edinbnrjjh, 1038, pngea 197 to 202, for a fuller dis- 
cussion of the use of t in judging the reliability of correlation coefficients. 



320 


RELIABILITY OF CORRELATION RESULTS 


for observed correlation = 0.55, and interpolating between n = 30 and 
n = 40, gives 0.32, which means that we can say that the true correla- 
tion is greater than 0.32, with the same degree of confidence. The 
figure can be used in a similar manner for any other size of sample 
up to 100, and any observed correlation. 

Figure B deserves close study, for it tells a great deal about the 
sampling reliability, or, rather, unreliability, of correlation coefficients. 
The bottom line, for example, shows that, when samples are drawn 
from a universe where the true correlation is zero, 1 sample out of 
20 will show a correlation as high as ±0.60, on the average, with 
samples of 10 cases; as high as ± 0.49, with samples of 15 cases; and 
as high as ± 0.35, even with samples of 30 cases. Similarly, if the 
samples are drawn from a universe where the true correlation is 0.50, 
1 sample out of 20, on the average, will show a correlation as high 
as 0.81, with samples of 10; as high as 0.73, with samples of 20; 
and as high as 0.69, with samples of 30. Many other similar com- 
parisons can be made readily. For example, if the true correlation 
is 0.80 and samples of 10 cases are used, 5 per cent of the samples 
will show correlations as high as 0.93. These facts do not take into 
account the tendency of many students to examine a number of pos- 
sible independent variables and to select for more detailed study those 
which show the highest correlation with the independent factor. If 
that is done, the possible minimum correlation in the universe, cor- 
responding to the correlation observed in the sample so selected, will 
be even lower than Would be estimated from Figure B. 

Correlation indexes. The reliability of indexes of (curvilinear) 
correlation, p, determined from very large samples, may be judged by 
the use of the following equation: 

1 — 

Standard error of index of correlation = - . ■— : (72) 

V n — m 

In using Table A to test the significance of such correlations for small 
samples by the t method, we must deduct 1 less than the number of 
constants necessary to represent the regression line mathematically 
[the value m of equation (26) minus 1] from the number of cases 
before using Table A. Thus if a correlation index computed for a 
cubic parabola fitted to 7 observations were to be judged, its reliability 
would be determined by using the column headed 4. Since 4 constants 
would be represented in the regression equation, 7 — (?n — 1) = 4. If 
the computation gives t = 2.8, Table A (or Figure A) indicates that 
7 out of 100 such samples, on the average, would give a correlation 



MULTIPLE CORRELATION 


321 


as high or higher than the observed correlation, even if there were 
no true correlation in the universe. 

Empirical studies of the sampling variability of indexes of correla- 
tion indicate that they tend to be skewed in their distribution, just as 
do coefficients of correlation; therefore parallel special methods must 
be employed in judging their significance. Figures C, D, and E, on 
pages 507 to 509, prepared to apply to multiple correlation coefficients 
in the same way that Figure B applies to simple correlation coefficients, 
may be used tentatively to judge the reliability of indexes of correla- 
tion, until more exact measures have been developed. Where m = 4, 
Figure C (for jBi. 234 ) niay be used; where m = 6 , Figure D (for 
•Ri. 23456 ) ; and where m = 8, Figure E (for iJi. 2345678 ) • These figures, 
also, are based upon methods developed by R. A. Fisher. 

Multiple Correlation 

Coefficients of multiple correlation and net regression. Correla- 
tion constants derived from multiple regression studies are even more 
subject to chance variation than arc those from simpler analyses. In a 
random sample of 30 cases drawn from a known universe, for example, 
the following values were obtained: 

fii.234 = 0.538; bi 2.34 = 0.583; 613.24 = <^.366; 614.23 = 0.949 

By drawing 15 more random samples of 30 cases each, 10 of 50 cases, 
and 5 of 100 cases, values were secured for R and the b’s as shown in 
the following statement and in that on the next page. 


Distribution op Values for Multiple Correlation Coefficients for 
Repeated Samples Drawn from the Same Universe (True Value 0.563) 


Range of values 

30 observations 

50 observations 

100 observations 

0.300-0.399 

4 

1 


0.400-0.499 

3 

5 


0.500-0.599 

4 

1 

4 

0.600-0.099 

4 

3 

1 

0.700-0.799 

1 




From these two tables we can see how the variation decreases as the 
number of cases increase, and can also see what values the constants 
from the several samples tend to center around, and so estimate the 
ai)proxiinate true value. But some definite idea of the range within 
which this true value probably would lie could have been obtained by 



322 


RELIABILITY OF CORRELATION RESULTS 


computing the standard error of each constant by means of the 
formulas: 

Standard error for a coefficient of muL| ^ f ~~ i^i.234 . . . n 
tiple correlation i?i.234 . . .n J ^ 

Standard error for a coefficient of par -1 ^ / S1.234 . . . n 

tial regression 612.34 ... n J ^ n 4 (l - ^2.34 . . . n) ^ 

Reliability of multiple correlation coefficient. The standard error 
for the value of R, 0 . 538 , given for our first sample works out to be 


Distbibution of Values for Net Regression Coefficients for Repeated 
Samples Drawn from the Same Universe 


Range of values 

30 

observations 

50 

observations 

100 

observations 

True value 

Values for 612 , 

34 : 





-0.79 to - 

-0.60 

1 




-0.59 to - 

-0.40 

0 




- 0.39 to - 

-0.20 

1 




-0.19 to - 

-0.00 

0 

2 



0 to 

0.19 

2 

1 

1 


0.20 to 

0.39 

• 6 

4 

4 

+0.320 

0.40 to 

0.59 

4 

3 



0.60 to 

0.79 

3 




Values for 613 . 

24 : 





-0.19 to - 

- 0 

2 




0 to 

0.19 

3 




0.20 to 

0.39 

5 

6 

2 

+0.377 

0.40 to 

0.59 

2 

2 

2 


0.60 to 

0.79 

1 

1 

1 


0.80 to 

0.99 

2 

1 



1.00 to 

1.10 

1 




Values for 614 . 

23 : 





0 to 

0.19 





0.20 to 

0.39 


1 



0.40 to 

0.59 

1 

1 



0.60 to 

0.79 

8 

2 

2 


0.80 to 

0.99 

3 

4 

3 

+0.824 

1.00 to 

1,19 

4 

1 


1.20 to 

1.39 


1 



1.40 to 

1.59 






^0 The standard errors of the several net regression coefficients can bo deter- 
mined at the same time that the regression coefficients are determined, and as part 
of the same set of computations. See Appendix 1, "Methods of Computation,” 



MULTIPLE CORRELATION AND NET REGRESSION 


323 


0.139. If we ignore the fact that the distribution of i?, just as of r, 
is not normal, we may interpret that roughly by saying that in 2 out 
of 3 such samples, on the average, the true value of R in the universe 
will be within the range R it ctr, or between 0.40 and 0.68. As it 
happens for this particular sample, the true value, 0.563, does lie within 
this range. Ten of the 16 samples, or 63 per cent, gave values falling 
within 0.139 of the true value, so the computed standard error is not 
so misleading in this particular case. For still smaller samples, or for 
higher correlations, the standard error computed by equation (73) 
would be less reliable. For such cases we would use instead the 
equation: 


-?^i.234 — m 

'N/ 1 — i?1.234 


(74.05) 


This equation is used together with Table A to judge whether there 
is real evidence that the true correlation exceeds zero, just as equation 
(71.1) is used in the case of the correlation coefficient and index. In 
using Table A, m — 1 must be subtracted from the number of observa- 
tions. (This also applies in using Table A for coefficients of net re- 
gression.) For more exact interpretations, Fisher’s transformation 
method, previously referred to, may be utilized. “ 

For small samples, the reliability of coefficients of multiple correla- 
tion varies not only with the correlation and the size of sample, but also 
with the number of independent variables. Fisher has developed 
an exact method for judging the probable significance of observed 
coefficients of multiple correlation.^^ Figures C, D, and E on pages 507 
to 509 provide a simple method of applying his conclusions for mul- 
tiple correlation coefficients, in the same way that Figure B provides 
for simple correlation coefficients. For problems involving 3, 5, and 7 
independent factors, respectively, these figures show the approximate 
minimum true correlation that probably exists in the universe with any 
size of sample up to 100, and for any observed correlation, with the 
probability that the statements based on the figure will be right for 
19 samples out of 20, on the average. Thus if, with 30 observations, 
a correlation of should be obtained, we can say that 

the true correlation (from Figure T)) is at least 0.58. Similarly, if 

See pages 460 to 474 in Appendix 1, “Methods of Computation,” for the most 
effective method of computing the standard errors of net regression coefficients, 
according to equation (74). 

^2R. A. Fisher, The general sampling distribution of the multiple correlation 
coefficient, Proceedwga of the Royal f^odefy. A, Vol. 121, pp. 654-673, 1928. 



324 


RELIABILITY OF CORRELATION RESULTS 


for 50 observations a correlation of JJ 1.234 = 0.62 were obtained. 
Figure C gives 0.42 as the probable minimum correlation in the sample. 
These conclusions, of course, apply only if the conditions of random 
sampling are fulfilled. Problems with 2, 4, or 6 independent variables 
may be considered by interpolating between the corresponding values 
given for 1, 3, 5, or 7 independent variables. 

Considering the problem mentioned above, where a sample of 30 
observations showed i?i,234 = 0.538, Figure C gives a value of 0.16 as 
the probable minimum correlation. From the single sample we could 
then say that the true correlation is probably at least 0.16 in the 
universe from which the sample was drawn, with one chance in 
twenty of being wrong. 

Figures C, D, and E show the possibilities of getting high correla- 
tions from a random sample, even when there is little or no correlation 
in the universe from which that sample was drawn. Thus for three 
independent variables, Figure C shows that, if samples of 15 observa- 
tions are used, in 1 sample out of 20, i2i.234 will be as large as 0.69, 
even if the correlation in the universe is zero, and as large as 0.78, 
even if the true correlation in the universe is only 0.40. Similarly, if 
there are 7 independent variables. Figure E shows that, if samples 
of 20 cases are used, in 1 sample out of 20, on the average, i2i . 2345678 
will be as high as 0.79, with zero correlation in the universe; 0.85, 
with 0,50 in the universe; and 0.91, with 0.70 in the universe. Even 
with samples as large as 100 cases, i2i.2345678 ^ per cent of the 

samples will be as high as 0.37 for samples drawn from a universe with 
zero correlation, and as high as 0.57 for samples drawn from a uni- 
verse with 0.40 as the true correlation. Figure D gives similar prob- 
abilities for 5 independent variables. Many other combinations of 
size of sample, true correlation in the universe, and observed correlation 
for 5 per cent of the samples are given in these figures. 

If the several independent variables in the multiple correlation 
had been selected by considering a large number of possible inde- 
pendent variables, and by retaining only those which showed the 
highest gross or net correlation with X^, there is a much larger pos- 
sibility of the correlation in the sample exceeding the true correlation 
in the universe by a wide margin. In fact, it is almost certain to be 
erroneously high. If error calculations are to be used to judge tlie 
sampling significance of the correlations or regressions observed, the 
variables must be selected purely on logical or deductive grounds (as^ 
discussed at length in Chapter 24) , rather than on any such basis of 
empirical selection of those which show the apparent closest relation. 



MULTIPLE CORRELATION AND NET REGRESSION 


325 


It should always be remembered that, if the choice is purely empirical, 
the next following period might readily reverse the order of apparent 
importance of the several variables. 

Reliability of net regression coefficients. Turning to the meaning 
of the regression coefficients, we may illustrate the case with one con- 
stant, & 12 . 34 . The value given by the original sample was 0.538. For 
that sample a 2 = 2.53, S 1.234 = 2.81, and JS 2.34 = 0.708. If these 
values are substituted in equation (74), the standard error works out 
to be 0.287. The observed regression may therefore be stated to be 
0.538 dz 0.287. This indicates that we can say that the true regression 
probably lies between 0.251 and 0.825, with the expectation that such 
statements will be right two times out of three, on the average; or 
we can say that it lies between —0.036 and 1.112, with the expectation 
that such statements will be wrong only one time out of twenty 
(0.045). Actually, the true value in this case was 0.320, or within 
the first range. It may be noted that 11 of the 16 samples showed 
regression coefficients for 612.34 within 0.287 of the true value, and 
all but one fell within 0.574 of the true value. Again this illustrates 
how the variability of constants which tend to be normally distributed 
may be estimated by appropriate error formulas, and hence how the 
reliance to be placed in conclusions from a given sample may be 
judged. 

From equation (74) it is evident that the reliability of a net regres- 
sion coefficient varies directly with the multiple correlation of the 
dependent factor with the other factors, but inversely with the multiple 
correlation of the particular independent factor with the other inde- 
pendents. The more closely a particular independent factor can be 
estimated from the other independent factors })resent, the less ac- 
curately can the net relation of the dependent factor to it be determined. 

The qualifications the use of this error formula tluows around 
regression results may be illustrated in a problem where the theory 
of sampling is fairly applicable, namely, the relation between the feed 
a herd of cows receives and the resulting milk production. Table 
75 shows these results for two different studies. 

This table illustrates two points: first, that the regression results 
are not very accurate even with a multiple correlation of 0.80; and, 
second, that the reliability of the regressions varies from variable to 
variable, being much greater in some cases than in otliers. It is obvious 
that some of tlie regressions would have no statistical significance at all, 
whereas others would indicate the probable relations within a fairly 
close range of accuracy. 



326 


RELIABILITY OF CORRELATION RESULTS 


Thus for the percentage of lime, with the P.E. = 67.4 per cent of the 
regression, there is 1 chance out of 2 that the true net regression varies 
from that observed in this sample by two-third's of the observed value, 
and 1 chance out of 6 that the true net regression is of opposite 
sign from that observed. With the total digestible nutrients, on the 
other hand, with the probable error only 12 per cent of the observed 
value, there is but little chance that the observed value differs from the 
true regression by more than 30 per cent, and very little chance that it 
differs as much as 40 per cent. 

If the regression equation is to be used solely as a basis for making 
new estimates of the value of the dependent factor to be expected 
for given values of the independent factors, then the accuracy of the 
several regression coefficients does not make such a great difference. 
Any deficiency in one may be compensated for by an excess in an- 
other. (This does not hold true, however, if estimates are made for 
extreme values of variables whose regressions are subject to large 
errors. See Chapter 19 on this point.) But if the major interest is not 
in the total estimate, but in the changes in the dependent factor with 
changes in each particular independent factor, then the reliability of 
each particular regression coefficient becomes of real importance. In 
the illustration cited, for example, it would not do to know merely 
that the milk production per cow varied both with protein content and 
with lime, if it was desired to know how much to allow for protein and 
how much for lime in compounding a ration. Instead, the probable 
errors indicate that the influence of protein (as represented in the 
“nutritive” ratio) has been fairly accurately measured, whereas the 
influence of lime has not been accurately measured at all. Not much 
confidence therefore can be placed in the conclusions as to this latter 
factor. 

In any correlation study where the results are based upon a sample 
of observations drawn at random from a known universe, and where 
any importance is to be attached to the values found for the several 
regression coefficients, it is essential that the standard errors of each 
of those coefficients be determined and considered. As is illustrated 
in the examples just discussed, a sample may have a very significant 
multiple correlation and yet yield regression coefficients for some 
variables which are almost entirely the result of chance fluctuation, 
and therefore of little or no significance. This may occur even with 
moderately large samples, such as the sample of 95 cases in the first 
example just considered. Computation, presentation, and discussion 



MULTIPLE CURVILINEAR CORRELATION 


327 


of the standard errors of the regression coefficients are therefore vital 
parts of any such multiple correlation study.^^ 

TABLE 75 

Peobable Eeeors op Partial Regression Coefficients, in Per Cent 
OP THE Value of the Coefficient * 


Item 

Wisconsin study 

Minnesota study 

Number of observations 

96 

10 

0.805=h0.039 

77 

8 

0.862d=0.034 

Number of variables 

Multiple correlation, adjusted for number of 
variables 



Probable Error of Regression Coefficients t 


Independent variable 

Per cent 

Per cent 

Total digestible nutrients 

12.0 

11.5 

Nutritive ratio 

12.4 

9.6 

Per cent of protein good ’’ 

28.3 

Per cent of lime 

67.4 


Per cent summer feeding 

17.6 


Per cent silage 

21.6 

13.7 

Fat test of milk 

10.6 

3.7 

Per cent fall freshening 

18.6 

11.8 

Value per cow 

26.8 

Age of cows 

17.9 

Per cent grain in ratio 


20.2 




Mordecai Ezekiel, The application of the theory of error to multiple and curvilinear correla- 
tion, Journal of the American Statistical Association^ Vol. XXIV, No. 105 A, March, 1929, Supple- 
ment, p. 103. 

t The coefficients are for the net regression of milk production on the factors stated. P.E. 
= 0.0745 of the standard error. 


Multiple curvilinear correlation. All the formulas for multiple 
regression constants cited up to this point have been derived for 

For illustrations of ways of presenting not regression coeflioients, together 
with their standard errors, see M. J. B. Ezekiel, P. E. McNall, and F. B. Morrison, 
Practices responsible for variations in physical requirements and oeonomio costs of 
milk production on Wisconsin dairy farms, Aqncrultural Exverimont StMion oj 
Wiscomin Research Bulletin 79, August, 1927, pp. 21-23; and Kat.hryn H. Wylie 
and Mordecai Ezekiel, The cost curve for steel production, Journal oj Political 
Economy f Vol. XLVIII, pp. 792-93, December, 1940. 









328 


RELIABILITY OF CORRELATION RESULTS 


linear correlation use. For curvilinear multiple correlation results, 
however, no measures of the probable error have yet been devised for 
the freehand process by logical and mathematical deduction. Ex- 
periments mentioned previously were initiated to provide at least 
some empirical measures of reliability. The results indicate that the 
index of multiple correlation must be corrected for the number of 
constants involved or assumed just as much as the coefficient of mul- 
tiple correlation, as has already been illustrated. 

The reliance to be placed on regression curves requires separate 
treatment. Where those curves are determined by fitting mathematical 
functions, the probable accuracy with which the true relation is ex- 
pressed by the mathematical curve may be judged by error formu- 
las which have been worked out mathematically by an extension of 
the same methods upon which those previously presented were based. 
For regression curves determined by the successive approximation 
process or by the graphic approximation process, no such mathemati- 
cal treatment is possible. Experimental study of the reliability of 
regression curves determined by successive approximations, however, 
has thrown some light on the reliability of such curves and made it 
possible to state the following general principles: 

First, the reliability of regression curves appears to vary inversely 
with the standard error of estimate for the entire sample. 

Second, the reliability of any point on a regression curve appears to 
vary directly with the square root of the number of observations on 
which that portion of the curve was based. 

Third, the reliability of any point on a regression curve, when 
stated as the difference between the value of the function at that 
point and the value of the function at the point corresponding to the 
mean of the independent variable, appears to vary inversely with the 
square root of the distance the selected point is from the mean of the 
independent variable, measured in units of the standard deviation of 
the independent variable. 

All these points apply equally to simple regression curves and net 
regression curves, computed while holding the influence of other factors 
constant. For net regression curves, one further point is involved, the 
extent to which one independent factor tends to vary with the other 
independent factors, which may be stated: Fourth, the reliability of 
points on a net (or partial) regression curve appears to vary inversely 
with the multiple curvilinear correlation of the particular independent 
factor with the other independent factors. 



MULTIPLE CURVILINEAR CORRELATION 


329 


The following formulas give a rough approximation to the standard 
error of net regression curves. These formulas express the four points 
just mentioned. In experimental work, these formulas, when com- 
puted from the results shown by individual samples have, on the 
average, successfully indicated the range within which the true re- 
gression curves lie 17 times out of 20 (using a range of twice the com- 
puted standard error). The proportion of very large errors, up to 
5 or more times the computed standard errors, has been larger than 
would be expected from a normal distribution of errors. These pre- 
liminary formulas may leave out some essential element in occasional 
cases, or the results of graphic freehand curve fitting may show errors 
in exceptional samples out of proportion to those ordinarily made.^^ 


The formulas are: 


^/(X) -fCXii) 


SyJ(^)VX 




4 . 


^l./(2,3,4)^^2 

0'2^u(l “ Pi. 34 ) 


(74.1) 

(74.2) 


Since several new symbols are introduced in these two equations 
to cover the points which have been enumerated, they will first be 
defined. 

The symbols have exactly the same meaning for both equations, 
except for the additional term (1 — P2.34) in equation (74.2) for regres- 
sion curves determined by multiple correlation. The standard errors 
of estimate, Sy.f(x)j and >Si./( 2 , 3 , 4 ) have the same meaning as defined 
in equations (21.1) to (22.2) and (64) or (65); cr^ and 0-2 are the usual 
standard deviations of the independent variable. The new terms have 
the following meanings: 

/(Xjif) means the reading from the regression curve f{X) for the 
point where X is equal to M^. 

means the reading from the net regression ciirve/i 2 . 34 (X' 2 ) 
for the point where X 2 = M 2 . 

riu represents the number of observations falling within some 
selected group interval of X or X 2 , with the point for which the 
accuracy of the curve is to be determined, at the center. This interval 

i^The derivation of these formulas is given in Mordeeai Ezeki('l, The sampling 
variability of linear and curvilinear regressions. Annals of Mathematical Statistics, 
September, 1930. 



330 


RELIABILITY OF CORRELATION RESULTS 


must be taken large enough to include the observations which were 
taken into account in determining the shape of that portion of the 
curve, yet small enough not to take in observations whose values did 
not enter into the determination of that part of the curve. 

u designates the range over which is taken, stated in units of the 
independent variable X or X 2 , and using the same units as those in 
which the standard deviation, a 2 , is stated. Thus if the standard 
deviation is in units of pounds or dollars, u is also stated in pounds 
or dollars. 

The other term, xov has the same meaning as used previously — 
the deviation of the independent variable from the mean of that vari- 
able, stated in the same terms as the standard deviation is stated in. 
Thus for the point along the curve where the independent variable 
has the value Xa, x = Xa^ Ma,. There is this difference from the 
usual usage, however, — in equations (74.1) and (74.2) x and are 
to be taken as positive numbers without regard to sign. 

The several steps in working out the reliability of a regression curve, 
and the meaning of the results, may be illustrated by applying these 
equations to one of the curves previously determined. 

The reliability of the regression curve worked out for cotton yields 
in Chapter 8 may be tested by equation (74.1). The curve obtained 
(Figure 23), on page 154, shows that with an average application of 
water, 1.97 feet, a yield of 328 pounds of cotton would probably be 
obtained, whereas with an application of 1.4 feet of water, a yield of 
195 pounds would probably be obtained. Apparently reducing the ap- 
plication of water 0.57 foot would reduce the yield of cotton by 133 
pounds. How accurate is this last conclusion? 

Picking out the values necessary to compute the standard error 
according to equation (74.1) , we have = 80,7 pounds, o-^ = 0.73 
foot, and x = 0.57 foot. Noting that the average yield in the groups 
of 1 to 1.4 feet of water and 1.5 to 1.9 feet of water both had some 
influence in determining the position of the curve at 1.4 feet, we may 
let the interval for which we take the number of cases extend half way 
into the upper group, or to 1.7, and an equal distance below 1.4, or to 
1.1. The number of items for then, will include all the cases having 
1.05 or more feet of water applied, and less than 1.75 feet applied, 
a range of 0.70 foot. The number of cases falling in this range 
(Table 31 on page 000) is found to be 6; so n,, = 6 and u = 0.70. 
Substituting these several values in equation (74.1), we find the 
arithmetic to be as follows: 



MULTIPLE CURVILINEAR CORRELATION 


331 


Standard error of decrease of 133 pounds 
. lsl.f(.)ux ^ /(80.7)2(0.70)(0.57) ^ /(6623) (0.70) (0.67) 

Standard error = 28.5 

The difference of 133 pounds, therefore, has a standard error of 
29 pounds. The statement that the reduction of 0.57 foot in water 
applied reduces yields by 133 pounds, therefore, really means that 
the reduction is probably between 104 and 162 pounds, but there is at 
least 1 chance in 20 that it is as little as 77 pounds, or as much as 189 
pounds. That is, if we make the statement that the true value for all 
the farms in the universe lies between 77 and 189 pounds, we should be 
wrong in at least 1 out of 20 such statements, on the average. (Table A 
need not be considered in computing these chances, unless the total 
number of observations in the problem, n — m, is less than 30. Then 
n “ m should be used to find the column to determine the probabilities 
from, rather than n^. In this case, with n — m = 11, the conclusions 
are not quite so reliable as these statements indicate.) 

It should be noted that the estimated range of error would not 
have been changed very greatly if a different interval had been used for 
u. Had the range from 1.25 to 1.45 been taken instead, u would have 
been 0.30 and n« would have been 5. If these values are substituted 
in equation (74.1) instead of the ones used previously, the standard 
error works out as somewhat lower than before. The greater the total 
number of observations, the less effect a change in u will have on the 
computed error — it is only the small number of cases and very irregular 
distribution that causes as considerable a difference as in this par- 
ticular case — and even so, the indicated reliability is still of the same 
order. 

To compute the range of error for the entire curve, we may pick 
out a number of selected points — say at each 0.2 foot of water — and 
work out the error for the reading at each of those points. The process 
may be shortened by noting that, in equation (74.1), the values of the 
several terms remain unchanged for every point along the curve, with 
the exception of u, and x; and that, if the same range is taken for 
u at each point, only the two other values are changed. Accordingly, 
equation (74.1) may be restated as follows: 

; where k = (74.11) 

Uu (^x 



332 


RELIABILITY OF CORRELATION RESULTS 


Since the value of k is the same for every point along the curve, it can 
be worked out once for all; then all that needs to be computed at each 

X X 

point is — , the product k — , and the square root of the product. 

riu flu ^ 

The work of applying this process to the cotton-yield curve may be 
shown in tabular form. First the value of k must be worked out. If 
we continue to use the same range for u as before, 0.70 foot of water, 
the computation is : 


k = 



(80.7)^(0.70) 

( 0 . 73)2 


8,638 


The next step is to enter X for each selected point, compute the 
value of X, and determine the value of n« and of a;/n„; then multiply 
by k and extract the square root. These several steps are shown in 
Table 75A. 


TABLE 76A 

Computing the Stanbabd Eskors tor Points Along a Regression Curve 






(Error)^ 

, / x\ 

Error 

X 

3 / 

riu 

X 






kl-) 

k ^ 






\ Tlu 

1.2 

0.77 

6 

0. 12833 

1095.71 

33.1 

1.4 

0.67 

6 

0.09500 

811.11 

28.7 

1.6 

0.37 

8 

0.04625 

384.89 

19.6 

1.8 

0.17 

6 

0.02833 

241.91 

15.5 

2.0 

0.03 

5 

0.00600 

51.23 

7.1 

2.2 

0.23 

4 

0.06750 

490.94 

22.1 

2.4 

0.43 

3 

0. 14333 

1223.78 

35.0 

3.5 

1.53 

2 

0. 76500 

6531.57 

80.8 


The values of n„ are determined just as in the single case before, 
by taking all the cases falling within 0.35 foot above and 0.35 foot 
below the value of X selected. Thus for X = 1.2, there are 6 cases 
between 0.85 and under 1.55; whereas for X = 2.4, there are but 3 cases 
between 2.05 and below 2.75. The series of values in the n„ column 
add to considerably more than the total number of observations, since 
the range taken is such that there is considerable overlapping. This 
does not affect the final errors computed, however, since the unit 
selected for u tends to have no effect on the size of the computed error. 






MULTIPLE CURVILINEAR CORRELATION 


333 


The importance of the errors shown in the last column of Table 
75A may be judged by comparing them to the values to which they 
apply — the difference between the estimated cotton yield for the several 
values of X and the yield estimated for the mean value of X. Table 
75B shows this comparison. 

TABLE 75B 


Deviation of Points on a Regression Curve prom the Value for the 
Mean, and the Standard Errors 


X 

/(X) 

/(A')-/(A'.„) 

Standard errors * 

2.25 (standard error) f 

1.2 

145 

-183 

±33 

±75 

1.4 

195 

-133 

±29 

±65 

1.6 

248 

- 80 

±20 

±44 

1.8 

293 

- 35 

±16 

±35 

1.9 

328 

0 



2.0 

335 

7 

± 7 

±16 

2.2 

376 

48 

±22 

±50 

2.4 

414 

86 

±35 

±79 

3.5 1 

543 

215 

±81 

■ 

±182 


* From Table 75A. 

t For a case where » — w/ - 11, range from the true value within which 05 per cent of the 
sample values will fall, on the average. 


The standard errors will have to be interpreted with respect to the 
total number of observations, adjusted by ni. For this iiroblein, m — 3, 
so Table A sliould be entered with 12. Inteipolating, we find that for 
such samples a departure of more than one standard error from the true 
value is likely to occur 34 times out of 100, and a de]uirture of more 
than twice the standard error is likely to occur about 8 times out of 100 
(as compared to less tluin 5 times for a very large samide). To 
estimate the extent of the true differences in yield lying beyond the 
observed differences which will be exceeded only in one sample out 
of 20, on the average, it is necessary to add about :h 2.25 times the 
standard error to the observed differences. This value is ac(H)rdingly 
entered in the final cohinin of Table 75132*"’ 

In view of the nit, her roujiih n]»iiroxiinjit,i()n to tlu' tnio sl:m<l:u'(l error of the 
curves given by thewe formiila.s, this use of Table A may be a ndhuMiK'nt which is 
hardly justified. As indicab'd before, 2 or 3 samples out of 20, on the average, 
may show departures exceeding twice the standard error, as calculated by this 
method. 



334 


RELIABILITY OF CORRELATION RESULTS 


Just as with a regression line, the value of Y corresponding to the 
mean value of X is not exactly certain. No special sampling study 
has been made of regression curves in this connection. It would hardly 
be correct to apply equation (70) directly to this, as the central value , 
for a freehand regression curve is not as definitely determined as for a 
straight regression line. Accordingly, the errors may be interpreted 
only with respect to differences from the mean value, rather than with 
respect to actual values. 

The regression curve with its computed standard error is shown in 
Fig. 60, together with the wider range to reduce the probability of error 
to 0.05. This chart indicates the interpretation which may be given 
to the computed errors. The inner borders indicate the range within 
which the relation probably lies, with the regression curve from 1 
sample out of 3, on the average, differing from the true curve by more 
than the range shown; whereas the outer borders indicate the range 
within which the true curve probably falls, with at least 1 sample out of 
20, on the average, giving a regression curve which differs from the 
true curve by more than the range shown. 

It is now quite evident why the table showing the relation according 
to the regression curve, as set forth in Chapter 8, page 154, was not 
carried beyond 2.5 feet of water. In fact, it might be just as well not 
to carry it beyond 2.25 feet, to judge from Figure 60, for at 2.5 feet 
there is at least 1 chance out of 20 that the true difference in yield 
above the yield for the average water application differs from that 
estimated by nearly the difference shown in the estimate. The wide 
range of possible error for the true position of this curve reflects both 
the small number of observations upon which it is based and the 
relatively low correlation shown by those observations. Even so, the 
computed range of error indicates what degree of reliance can be 
placed in the findings under these limiting conditions, and so makes 
the results of the analysis of more value than if we had no knowledge 
of their probable stability. 

It is evident from the illustration that certain portions of a regres- 
sion curve may be much less accurately determined than certain other 
portions. It is not merely the total number of observations in the 
sample, but the way they are scattered or bunched along the curve 
which is fitted, which affects the reliability of the various portions of 
the regression curve. 

The process of working out the standard error for a net or partial 
regression curve is exactly the same as that just illustrated, except 
that equation (74.2) is used instead of (74.1). The computation may 



MULTIPLE CURVILINEAR CORRELATION 


335 


be broken into two steps just as illustrated for simple correlation, as 
follows: 


'-/l2.84C^Af 2) 



where 




>Si./(2.3,4)^ 

40 . - Pi. 34 ) 


(74.21) 


The multiple curvilinear intercorrelation of each independent vari- 
able with the remaining independent variables can be determined fairly 
rapidly by the use of the short-cut graphic method. Thus for the 



Fig. 60. Curvilinear regression of 'cot-ton yield on irrigation water applied, and 
range within which the true relation probably lies. 


second problem used in Chapter 16, computation of the standard error 
zone for the several independent variables involves determination of 
the index of multiple correlation, P, for each of the three supplementary; 
regression relations, as follows: 

P2.34 from X 2 = + /24.;i(^4) 

Pa . 24 from X-^ ~ f 32 .4(^2) ~lr fiu .2(X^4) 

P4.23 from X 4 = f 42. 3 (^ 2 ) finMXi^) 


(A) 

(B) 

(C) 



336 RELIABILITY OF CORRELATION RESULTS 

In these three regression equations, to prevent confusion the same 
notation has been used for the subscripts to the “jf^s” in designating the 
several net regression curves as is used ordinarily in distinguishing 
the several net regression coefficients. 

Aiter the three sets of regression curves, (A), (B), and (C), are 
determined by the short-cut process, the final residuals for each may 
be read off from the final charts, the values of auu , 0-3 ana Zq , 
0-4 and Z4' determined, and the values of P2.34> P3.24, i^i^d P4.23 com- 
puted from these by equation (66.3), just as was illustrated in Chapter 
16. With these three values, and the standard deviations of all the 
variables, we can then compute the standard error 20110 for each net 
regression curve by equation (74.21), carrying through for each variable 
in turn computations similar to those just indicated. 

TABLE 75C 


Computing the Standard FRRORf for Points along the Net Regression 

Curve /i2.34(-X'2) 


X 2 

0:2 

Till 

X2 

riu 

(Error) “ 

‘■(i) 

Error 

\ riu 

25 

37.97 

3 

12.66 

24.41 

4.9 

35 

27.97 

4 

6.99 

13.48 

3.7 

45 

17.97 

3 

6.99 

11.56 

3.4 

55 

7.97 

2 

3.98 

7.67 

2.8 

65 

2.03 

5 

.41 

.79 

0.9 

75 

12.03 

7 

1.72 

3.32 

1.8 

85 

22.03 

7 

3.15 

0.07 

2,5 

95 

32.03 

4 

8.01 

15.44 

3.9 


Carrying out these curvilinear correlation analyses, we obtain valiuss 
of Pi. 34 = 0.6986 and P3,24 = 0.4809. The standard deviations are 
also computed, giving <72 = 513.76^and 0-3 = 44.43. The A/2 = 62.97, 
and Ms = 69.87. The values of and af, page 294, are 14.93 and 
51.70, respectively. 

Using equation (74.21), we next calculate the value of This 
involves deciding on the value of u to use. For X2, where the observa- 
tions run from 18.3 to 88.3, an interval of 20 seems appropriate, begin- 
ning at 16. Accordingly, fc' becomes 

, , ^ ^1/(2 ,^4)^ ^ (14.93) (20) 

- Pi.34) (513.76)(1 - 0.6986) “ 


MULTIPLE CURVILINEAR CORRELATION 


337 


Table 75C shows the work set up in the same form as in Table 
75A. The values in the Uu column are taken from Table 69A of 
Chapter 16, by calculating the frequency of Zg in each 20-unit range 
around the X 2 values stated. Thus, for the group with = 35, 
there are four observations in the range from 25 to 45, and 4 is there- 
fore the riu value for this group. The next group, X 2 = 45, includes 
the range 35 to 55, with three observations. The fact that some ob- 
servations are counted twice makes no difference, as that is allowed for 
by the inclusion of the u value in equations (74.2) and (74.21) • 
Similarly, for the regression /i3.24(Z3), ¥ becomes 

,, ^ 5f./(2.3.4)^ ^ (14.93) (10) ^ 

<ri(l - Pi. 24 ) (44.43) (1 - 0.4809) 

Here, with Z3 varying from 59 to 86, units of 10 are used for u. The 
computation of the errors is as follows: 

TABLE 75D 


Computing the Standard Errors for Points along the Net Regression 

Curve 






(Error) ^ 

Error 




X2 


1 

^3 

xz 

Uu 

riu 

X 

k' - 





T^u 

V 

60 

9.87 

4 

2.47 

15.09 

4.0 

65 

4.87 

3 

l.()2 

10.40 

3.2 

70 

0.13 

12 

.01 

.00 

0.3 

75 

5.13 

12 

.43 

2.78 

1.7 

80 

10.13 

1 

10.13 

65.57 

8.1 

85 

15.13 

1 

15.13 

97.94 

9.0 


The values for are likewise obtained from Table 69A, taking the 
frequencies of Z3 in each 10-unit range around the Z3 values selected. 

The standard errors as computed in Tables 75C and 75D could 
next be compared with the values to which they apply, by working 
out the departures of the net regression curves from their means just 
as was shown in Table 75B. That step will be omitted here. Instead, 
the errors are plotted graphically as d= departures from their respective 
net regression curves, as shown in Figure 61. In this case, with n = 18, 
and 7/2 = 8 (note page 293 of Chapter 16) , we enter Table A (or 
Figure A) with 11 to find tlie significance of the departure. That 



838 


RELIABILITy OP CORRELATION RESL’LTS 


gives US approximately 0.34. Accordingly, we conclude that in one 
sample out of three, on the average, the net regressions would miss the 
true regressions in the universe by larger amounts than those indicated 
in Figure 61 for this particular problem. 

When we compare the zones of standard error in Figure 61 with the 
distribution of the original observations as shown in Figures 51 and 52 
of Chapter 16, we see that the values of the independent variable for 
which the regression is fairly accurately determined are the values 



.Zy -Wa^es in cents per hour 


Fig. 61 . Cumlinear net regressions of steel costs per ton on operation rate and 
wage rates, and standard error range for the net regressions. 

where the bulk of the observations fall. Thus, in the case of X 2 , 
capacity operated, the observations are thickly clustered from 65 to 90, 
and thinly spread over the rest of the range. In consequence, the zone 
of standard error is narrowest in this region where relatively more 
observations were available to determine the slope of the curve. 
Similarly, for X 3 , wage rates, the bulk of the observations fell between 
70 and 75, with a thin scatter below 70 and with only one observation 
above 80. This distribution also is faithfully reflected in the standard 
errors, with a very wide error zone about 75, indicating that little 
is known of the slope of the regression curve in that range. The error 
equations (74.2) and (74.21) have this especial property of indicating 



REGRESSION CURVES PITTED MATHEMATICALLY 


339 


the accuracy of determination of the various portions of each regres- 
sion curve, in view of the adequacy with which that portion of the 
curve is represented by the distribution of the observations in the 
sample. 

It will also be noted, in Figure 61 , that if a horizontal line were 
passed through the mean of each curve (the point of zero error) it 
would fall entirely within the standard error zone for most of its length 
for but would fall largely outside the zone of error for 

/i2.34(-^2)- That means that there is no certainty that there was 
any net relation between costs (Xi) and wages (X3), whereas there 
is definite indication of a net relation between costs (X^) and capacity 
operated (X2). This result is secured even though the correlation of 
X2 with X3 and X4 is materially higher than the correlation of X3 
with X2 and X4. The net relation found between X^ and X3 could 
readily have occurred by chance; there is much less possibility that 
the observed net relation between X^ and X2 could have been a chance 
result of the particular rates of operation which occurred during the 
years under study. The errors for /4(X4) are not calculated, since 
there seems little real meaning in a standard error for a trend regression. 

In a problem drawn from a time series, the meaning of error 
coniputations is less certain than in samples drawn at random from a 
true universe. Even in time series, ho^Yever, the error computations 
may serve as some indication of the closeness within which the avail- 
able data can locate the underlying relationships. (See also the next 
chapter for the significance of error formulas in time series.) 

In all problems based on random sampling, where any generaliza- 
tions as to the relations in the universe are to be based on the shape 
or slope of the final curves obtained by the graphic method, the error 
zones should be computed and sliould be given due consideration when 
the data are presented. This is just as important for curvilinear regres- 
sions as is the use of standard error values for linear regression co- 
efficients. 

Regression curves fitted mathematically. Where regression curves 
are obtained by fitting definite mathematical equations to the data, 
the standard error of the curve may be judged by the same methods 
previously presented for determining the ])robable errors of net regres- 
sion coefficients. Thus if a parabola of the formula 

A'l = a + 6X2 + h'Xl 

is determined, the standard errors of h and b' may be determined 
by equation ( 74 ), treating A^2 X| as two independent variables. 



340 


RELIABILITY OF CORRELATION RESULTS 


The range within which the true curve probably lies may then be 
worked out just as has been illustrated for a linear regression. Simi- 
larly, if net regression curves are determined by fitting several mathe- 
matical equations simultaneously (as presented in detail in Chapter 
22), an extension of this same method may be used to judge the relia- 
bility of each of the net regression curves so obtained.’-® 

Summary. Coefficients of correlation and of regression, indexes 
of correlation, and regression curves, when determined from a limited 
sample, may depart more or less widely from the true value for the 
universe from which that sample was drawn. This chapter presents 
methods by which the possible extent of that variability may be judged. 
These methods represent an extension of the methods presented in 
Chapter 2 for judging the reliability of averages. 

The methods of this chapter apply only when the correlations are 
determined from samples so selected as to comply with all the assump- 
tions of random sampling. Where the samples are selected by other 
methods, the results may be of greater or of less reliability than if 
random sampling had been employed. Furthermore, in many types 
of problems, such as in time series, the observations can hardly be 
regarded as samples drawn from a universe. In such cases, statistical 
measures of reliability have a less precise meaning, but may still be 
valuable as a caution on the use of the results. 

See Henry Schultz, Tlie standard error of a forecast from a curve, Journal of 
the American Statistical Association, pp. 139-185, June, 1930. 



CHAPTER 19 


THE RELIABILITY OF AN INDIVIDUAL FORECAST AND OF 
TIME-SERIES ANALYSES 

The preceding chapter has indicated the kind of variability frona 
sample to sample that may be expected in determining statistical con- 
stants, such as regression and correlation coefficients, and in determin- 
ing regression lines and curves. It has provided means of estimating, 
from the values obtained from a single sample, various indications of 
how far and how frequently the results from successive samples of 
the same size are likely to vary from the true values in the universe 
from which the samples are drawn. 

Reliability of an Individual Forecast 

The practical statistician frequently has to deal with a quite dif- 
ferent problem. Having taken a given sample, and having determined 
from that sample how the selected dependent variable is related to 
one or more independent variables, he then has the problem of draw- 
ing new observations of the same independent variable (s) from the 
same universe, and of estimating from those new values the most 
probable value of the dependent variable for the new eases. In series 
involving time relations, this becomes the ])robl(‘in of forecasting. In 
the corn-yield problem of Clmpter 14, for example, it is possible to 
forecast the ultimate yield for the season as soon a,s the rainfall and 
temperature during the growing season are recorded. In j)roblems in- 
volving crop production and i^rice, it is i)ossible to foi-ecast the average 
price for the season as soon as the average crop is known. In these 
two cases, involving successive observations in time, theories of siin])lc 
sampling do not apply rigorously, since the observations are not drawn 
fully at random. (See discussion of the time seri(‘s i)robleni later in 
this chapter.) 

In other problems, however, saniiding theory may l)c fully a])- 
plicable. In a sample of children drawn at random from the school 
population of a given city, certain relations may be determined be- 
tween their age and height and their weight. From these relations, 

341 



342 


ERROR OF THE FORECAST 


how closely can we expect to estimate the weight of a new child, 
selected at random from the same population? In problems such as 
this, we are concerned with the possible difference between the esti- 
mated value, X[^ and the actual value, Zi, for new observations drawn 
from the same universe as the sample. Heretofore we have calculated 
standard errors for the regression coefficient and line and standard 
errors of estimate for the observed errors in estimating X or Z^, in 
the sample under study. Also we have used methods of adjusting 
the standard error of estimate to obtain the most probable variation 
from the true regression line in the parent universe. The present prob- 
lem, however, involves the accuracy of estimates made from the line 
or curve obtained from the sample, in the light of the possible sampling 
errors of that line, as compared to the true line, plus the possible range 
of errors of the estimates around the true line. What we need, there- 
fore, is a means of combining the standard error of the regression line, 
oTft, with the standard error of estimate, ~Si ^23 . . . n- 

Simple correlation. For a simple two-variable correlation, the 
square of the standard error of a single estimate is given by the 
equation ^ 

+ 52 ^ (75). 

Applying this equation to the illustration used previously, on page 
316, we can tabulate the calculation of various values as follow^s: 





Calculation of o-„ 


Selected 

Departures 





values 

from 





of X 

mean, os 

2 

(Tyr 

^vx 

(7y-y, — 




(3) + (4) 

(Ty-yr 

(1) 

(2) 

(3) 

(4) 

(5) 


0.97 

-1.00 

14.0650 

68.62 

82. 6850 

9.09 

1.47 

-0.50 

7.1793 

68.62 

76.7!)93 

8.71 

1.97 

0 

4.8841 

68.62 

73.5041 

8.57 

2.47 

0.60 

7.1793 

68.62 

75.7993 

8.71 

2.97 

1.00 

14.0650 

68.62 

82.6850 

9.04 

3.47 

1.50 

25.541 

68.62 

94. 1610 

9.70 

3.97 

2.00 

41.6077 

68.62 

110.1977 

10.50 


The last column gives the standard errors of estimate for values of 
Y estimated from new values of Z drawn from the same universe. 
It is apparent from these values that standard errors for individual 


^ The derivation of this equation is given in Note 14, Appendix 2. 



SIMPLE CORRELATION 


343 


forecasts near the mean of X are but little larger than Thus the 
standard error for the forecast of 22.3 for F' when X = 1.47 is only 
= 8.71, as compared with Syx = 8.28. The further the observed 
value of X departs from the mean, the larger the uncertainty of the 
individual forecast. Thus when X = 3.97, (Tyr^y = 10.50. We can 
state this uncertainty of the estimate more simply by expressing the 
relation as follows: 

When X = 1.47, Y = 22.3 zb 8.71 
When X = 3.97, F = 64.0 zb 10.50 

Here we have introduced a new symbol, F, to designate the probable 
range within which the true value will lie, for two estimates out of three 
on the average. 

These standard errors of individual forecasts are interpreted in 
the same way as any other standard error, as indicating (for various 
selected multiples of the standard error) the proportion of a succession 
of such forecasts which will show departures from the true values of 
stated sizes. Thus, in the problem illustrated on pages 147 and 151, 
when yields are estimated for new plots with 3.97 feet of water ap- 
plied, two out of three new observations, on the average, should show 
yields falling within 10.45 ten-pound units of the estimated yield. 
Table A, in Chapter 2, should be used in interpreting this standard 
error in exactly the same way it has been used before. 

If we wish to know the ranges within which the actual value will 
agree with the forecasted values except for a specified proportion of 
the estimates, say 5 out of 100, wc can determine those ranges by 
computing each one of them from the formula 

F = Y'±tay^^y (76) 

The value to be used for t is obtained from Table A, page 23, or 
Figure A, in Appendix 3, by selecting the value which gives 0.05 as the 
proportion of cases. (In Table A, t corresponds to the values in the left- 
hand column; in Figure A, to the abscissas, shown across the bottom.) 
For example, assume that wc were estimating the probable yield of 
cotton for a new plot where 2.97 acre-feet of water had been applied. 
The estimate, F', is 51.9 ten-pound units. What is the true yield likely 
to be? Equation (70.21) then becomes 

F = 51.9 dz (9.04)^ 

The regression line fpng(‘ 148) was determined from a sami)le of 
14 observations. The straight line involves two constants, so the n for 



344 


EREOR OF THE FORECAST 


Table A, Chapter 2, = 14 — 2 + 1 = 13. Interpolating between the 
lines for n = 12 and n = 16 in Figure A (page 505) on the ordinate 
corresponding to 0.05, we find the corresponding abscissa gives t as 2.16. 
Upon substitution, the equation becomes 

Y = 51.9 ± (9.04) (2.16) 

7 = 51.9 ± 19.5 


Accordingly, we estimate that the true yield will lie between 32.4 and 
71.4 ten-pound units, or between 324 and 714 pounds, knowing that we 
are likely to be wrong only in one out of twenty such estimates, on 
the average. 

Multiple correlation. The equation for the standard error of an 
individual forecast made from a multiple regression equation is similar 
to that given for simple correlation, with the addition of expressions 
for the additional variables, as follows: 


2 


>Sl.234 


Id h C22X2 + ^33X3 

n 


+ C44XI -f 2023X2X3 + 2024X2X4 + 20^4X3X4 


(77) 


In this equation X2, x^^ and X4 are the values of the independent 
variables for which the forecast is made, stated as departures from 
the respective means itf 2 > and M4, as calculated in the original 
sample from which the regression equation was calculated. The 
c values for equation (77) are obtained by the simultaneous solution 
of the following equations: 


( 2 ^ 2)022 + (2a’2a:3)co3 + ( 2 ^ 2 X 4 ) C 24 == 1 

( 2 ^X 2 X 3)022 + (Sa:5)c23 + (2:r3.T4)c24 = 0 ■ 

(S.r2.r4)C22 + (^X3X4)C23 + (Sa'4)c24 == 0. 

(S.ri)C32 + (2a’2:r3)C33 + (S.T2.T4 )c34 = O' 

(^X2X3)C32 + (20*3)^33 + (S.r 3 .r 4 )c 34 == 1 ■ 

(2x2X4)032 + (2x3X4)033 + (2xi)C34 = 0 , 

( 2 x 1 ) C42 + (2x2X3)043 + (2x2X4)044 = O' 

(2x2X3)042 + (2x3)043 + (2x3X4)044 = 0 

(2X2X4)042 + (2X3X4)043 + (2xi)044 = 1 _ 


(78) 


(79) 


( 80 ) 



CURVILINEAR CORRELATION 


345 


If these equations are solved, it will be found that C 23 = C 32 , 
C 24 = C 42 , and C 34 = 43 . Also, it is evident that the coefficients of 
the equations to the left hand of the equality signs are identical in all 
three sets of equations and are also identical with those of the normal 
equations (38), used in determining the net regression coefficients. 
These two facts make it possible to compute the values of all the 
c’s at the same time that the 6 ’s are computed, with only a relatively 
slight additional amount of work. This process is given in detail in 
the appendix, ‘^Methods of Computation,’’ pages 469 to 474. 

The c’s for a large number of independent variables are obtained 
by an expansion of equations (78) to (80) , setting up as many sets of 
simultaneous solutions as there are independent variables and placing 
the 1 on the right-hand side of the equations opposite the variable 
whose ( 2 xfJ occurs with the c,m^s, just as for the second set of equa- 
tions (79) above; 1 occurs to the right of the equation where ( 20:^)033 
occurs as one of the items on the left of the equality sign. 

The standard error of the individual forecast, according to equation 
(77), will differ for each combination of values of the various inde- 
pendent variables shown in the new observation. If the values of 
these several independent variables ftll_fall at about their mean values, 
<Ta._ / will be only slightly larger than /Si. 234 - If fh^y fall far from it, 
or even if one independent variable falls far from it, where the standard 
error of the net regression for that variable is very large, the standard 
error of the estimate will be correspondingly large. 

For n variables, the general formula for the square of the standard 
error of the individual estimate is given symbolically by 

2 ^ 

^ x' 1.2Z .. .n-'H 

Si .23 . . . 71 1^1 + “ + (^^2^2 + <^3^3 + . . . (81) 

In expanding equation (81) for any number of variables, it must bo 
interpreted by the special condition that C 2 C 2 ~ C 22 , ~ etc. 

The standard errors of individual estimates made from multiple re- 
gression equations, according to equations (77) or (81), can be inter- 
preted in exactly the same way as given above for standai’d errors of 
individual estimates from simi>le regression equations, from cc|uation 
(75). 

Curvilinear correlation. Where a simi)le or inultii)le curvilinear 
relation is determined by fitting mathematical regression equations, 
the standard error of individual estimates can be computed by an 



346 


EREOU OF THE FORECAST 


extension of equation (77). Thus if a cubic parabola has been fitted 
using 

Y = a + hX + + hX^ 


we can compute this equation most readily by writing it in the form 

Y a + h2X + hU + hV 

where U = and 7 - 

The standard error of an individual estimate is then given by the 
equation 




1 ~j— -j— Cxx^ “h ^uit^ ”f“ OdtJO 

n 


] ^CxuPO'H “f” ^CxqjXV “j“ 


Similar expansions are available for mathematical regression equations 
for two or more variables.^ 

Where the regression curve has been determined graphically, the 
standard error for simple correlation is as follows: 

+ [standard error of f{X) - /(Xjif)]^ (82) 

In equation (82), the last term of the equation is that determined by 
equation (74.1), for the particular value of X for which Y is to be 
estimated. 

In the case of multiple curvilinear correlation, with the regressions 
graphically determined, no precise equation has yet been developed 
to give the standard error of individual estimates. A roughly ap- 
proximate value may, however, be calculated as follows: 

where the second, third, and fourth terms are the standard errors of 
the curvilinear regressions for the particular values of X 2 , X 3 , and X^ 4 , 
which are represented in the estimate of X], calculated according to 
equations (74.2) or (74.21). 

2 Henry Schultz, The standard error of a forecast from a curve Journal of the 
American Statistical Association, pp. 139-185, June, 1930. 



EXTRAPOLATION BEYOND OBSERVED RANGE 


347 


Equation (83) gives only a rough approximation to the true stand- 
ard error because it excludes the terms which provide for the cross- 
products between the different independent variables. Where the 
intercorrelations (P 2 . 34 , etc.) between the independent variables are 
low — say, 0.50 or lower — ^this will probably not affect the calculated 
error very much. Where the intercorrelations are quite high, this 
estimated value may overestimate or underestimate the true error by 
a considerable margin. 

The Applicability of a Regression Equation to an Extrapolation 
beyond the Observed Range 

We have already seen examples, in Chapters 14 and 16, of how 
estimates might sometimes need to be made for new observations which 
lie beyond the range included in the original sample. We have also 
seen the possibility of exceptionally large errors of estimate when the 
formulas or curves are extrapolated in this way beyond the observed 
range. A rough rule-of-thumb has been given that estimates beyond 
the observed range should never be made, or, if they must be made, 
should be regarded as exceptionally hazardous. This present section 
will explore further the meaning of the statement ^^beyond the range 
of observation.” 

Where only two variables are concerned, there is no question as to 
the range covered in the original observations. Thus if we consider 
the data plotted in Figure 23, on page 154, it is apparent at once that 
the independent variable, X, covers the range from 1.2 to 3.5. Any 
new values of X smaller or larger than those values would be beyond 
the observed range. 

Where two or more independent variables are concerned, the situa- 
tion is more complex. Thus the data of the example plotted on 
page 170, in Figures 25 and 26, show that the acres range from 60 to 
240, and the cows range from 0 to 18. Suppose a new observation were 
drawn from the same universe, with 225 acres and 17 cows. Would 
that observation be within the original range? At first it might seem 
that it would, since the number of acres falls within the original 
acreage range, and the number of cows within the original range for 
cows 

Multiple correlation, however, is concerned not merely with the 
relation of the dependent variable to each independent variable sepa- 
rately, but with the composite relation to all the independent variables 
together. Is the combination of 17 cows and 225 acres, whose effect 



348 


ERROR OF THE FORECAST 


was represented, either exactly or approximately within the original 
observations? This combination involves the joint values for X 2 and 
Xs, which were represented in the original observations. These are 
shown plotted on Fig. 27, page 170. It is evident from this figure 
that the new combination lies well outside the observed joint distribu- 
tion of cows and acres. 

The original sample had some farms of between 200 and 250 acres, 
but none of them had more than 6 cows. It also had some farms of 
15 or more cows, but none of them had more than 120 acres. The 
single original case that came anywhere near the new observation was 
a farm with 14 cows and 180 acres. Even this one case is quite dif- 
ferent from the new observation with 17 cows and 225 acres. Since 
the new observation lies well outside the joint distribution or combina- 
tion of values represented in the original sample, any estimate made 
for it from a regression equation based on that sample is subject to 
an extra degree of hazard, beyond that given by the error formulas 
discussed in the preceding portion of this chapter. Those formulas 
give accurate values of the probable error of individual estimates only 
within the range represented by the original sample. Extrapolation 
of the regression equation or curves beyond that range, or combination 
of values, represents an extension into unknown fields, where sudden 
changes in the nature of the relations might conceivably occur. 
A priori knowledge of the relations, based on technical facts and the- 
ories, or on other evidence, may justify extrapolations of the curves. 
Any assumption that the errors of such extrapolations can be calculated 
from the error formula derived from the sample depends on a con- 
tinuation of the observed relations into the unsampled range of values. 
• Such an assumption can be justified only by other information, inde- 
pendent of the observed values and the constants calculated from them. 
Estimates of error for such extrapolations are only as reliable as the 
assumptions on which the extrapolations are based. 

Where there are three or more independent variables, it is still more 
difficult to determine whether a given new combination of values lies 
outside the joint distribution of the three or more variables in the 
original sample. In many cases this can be determined by careful 
checking of the new observation against such dot charts as those in 
Figure 42, on page 270 of Chapter 16. Thus, suppose a new observa- 
tion were drawn with 2 cows, 100 acres, and 4 men. Would this be 
within the range of the original observations? 

Careful inspection of the charts on page 270, and of the data on 
page 199, reveals that, although the combination of 2 cows and 100 acres 



THE USE OF ERROR FORMULAS WITH TIME SERIES 349 


is well within the observed joint distribution for those two variables, 
no such combination occurred with 4 men, or even with 3 men. The 
nearest values are one observation (No. 7) of 3 men with 6 cows and 
170 acres and one other (No. 12) of 3 men with 15 cows and 120 acres. 
The new observation, of 4 men with 2 cows and 100 acres, would ap- 
parently involve much more human labor, to care for that many cows 
and acres, than was represented in the original observations, and 
therefore lies far outside the joint distribution represented in the 
sample for the three values. It is quite possible that that much 
labor would represent a wasteful use, so that the additional men would 
be more likely to reduce the farm income rather than increase it. An 
estimate of income for this new farm, based on the relations shown 
in the sample for quite different farms, might therefore be very sadly 
in error. 

The rough process of comparing the new observation with the values 
of the independent variable for the original observations, as illustrated 
above, may serve reasonably well for determining whether the new 
observation is or is not represented in the original sample. Methods 
are available for computing the exact probability of the new observa- 
tion being drawn from the distribution represented in the original 
observations.® Carrying through such calculations ordinarily would 
seem to involve an amount of labor out of proportion to the value of 
the information obtained. For very exact work, or for estimates of 
very great importance, however, it might be worth working them out. 
This would be true especially where the new observation happened to 
fall at about the edge of the distribution zone of the previous observa- 
tions, so that it was uncertain whether or not it would be safe to 
estimate the dependent variable from the relations previously observed. 


The Use of Error Formulas with Time Series 

Many of the problems that are important in economics and other 
social sciences involve measurements in time. Even in crop forecast- 
ing and in some other problems involving biological reactions, time 
series must be analyzed. 

All the error formulas presented in this chapter and in the preced- 
ing chapter, as well as those in Chapter 2, are based upon the theory 

® The article by Waugh and Been, cited in full at tlio end of this (‘haptor, gives 
formfilas for calculating this probability. This article also considc'rs tlio standard 
error of the individual estimate and gives error formulas similar to those presented 
earlier in this chapter. 



350 


ERROR OF THE FORECAST 


of simple sampling. That theory assumes that each observation in a 
sample is selected purely at random from all the items in the original 
universe. It also assumes that successive samples are selected in such 
a way that value found in one sample have no relation or connection 
with the values found in the next sample.’* If the successive months or 
years in a time series are regarded as successive observations, the first 
assumption obviously may not hold true. Each successive item of a 
linear trend line is perfectly correlated with each preceding item. Each 
price of a given commodity on succeeding days or months may show 
some relationship to prices in the preceding period. If the correlation 
between each item of a series and each item of the same series fol- 
lowing it in time is calculated by the usual methods, the resulting 
correlation coeflBcient is termed the coefficient of serial correlation. In 
time series, almost every variable will show serial correlations that 
differ significantly from zero. That fact has been urged as a reason 
why the theory of errors cannot be used at all with such data. It 
also has been urged as a reason for not even using ordinary correlation 
techniques with time series, unless special devices, such as successive 
first differences, are used to eliminate the serial correlations.® 

Time series also differ from the situation assumed in simple sam- 
pling in their lack of constancy of the universe. The formulas of simple 
sampling assume that there is a large or infinite universe of similar 
events, from which the sample is drawn at random. Such a universe 
might be, for example, the number of dots turned up at each throw by 
throwing a pair of dice a large number of times. They also assume 
that new observations or new samples will be obtained by drawing 
in exactly the same way from exactly the same universe, as by making 
additional throws with the same set of dice under exactly the same 
conditions. Precise probability forecasts can be made from the original 
sample concerning the proportions of new samples or observations that 
will show certain characteristics under these ideal and highly simplified 
conditions. 

When any phenomenon is sampled at successive intervals of time 
the ^^universe^^ being studied can never be precisely the same. Even 


■^Note the way this assumption comes into the derivation of the equation for 
the standard error of the arithmetic average, in Note 1, Appendix 2. See also 
Richard von Mises, Probability, Statistics and Truth, The Macmillan Co , New 
York, 1939. 

® See Alexander Sturges, Price analysis as a guide in marketing control : the use 
of -correlation in price analysis, Journal of Farm Economics, Vol. XIX, pp. 699^-706, 
August, 1937. 



THE USE OF EREOR FORMULAS WITH TIME SERIES 351 


successive astronomical observations differ, even if in imperceptible 
degrees, because of the loss of matter radiated from the various stars. 
Surveying measurements in successive years may differ because of 
slight geological shifting of the earth’s surface, or because of erosion 
or other changes in the soil surface. Normal crop or livestock yields, 
as seen earlier, may change because of improvements in the biological 
make-up of the seed or in the strains of stock so that what would be 
normal yields for certain weather or feed at one time become sub- 
normal yields at another. The ‘‘population” of corn plants or of cows 
is not static — it changes constantly as one generation passes away and 
new ones come upon the scene. Human populations, too, change con- 
stantly by birth, growth, and death, so that what is the normal average 
height or weight in one year is different in another. The habits and 
ideas of living men may change much faster than the people them- 
selves change. Ordinarily, perhaps, those habits and ideas change 
slowly — most people will react to the idea of “socialism,” for example, 
one day in much the same way that they reacted to the same idea a 
week or a month earlier. But sometimes, under the force of social 
pressures, world-sweeping events, or economic or other catastrophes, 
ideas change swiftly and dramatically. Many Iowa farmers who, in 
1928, were born-and-bred Republicans of the most conservative type, 
were threatening to hang judges to prevent foreclosure sales four 
years later, in 1932. 

To the extent that changes in the universe follow a steady rate of 
progression, they can sometimes be allowed for by trend factors (as 
shown earlier) or by progressive shifts in the regressions themselves. 
So long as the composition of the universe — in correlation analyses, 
the character of the relations or reactions which are under study in a 
given set of circumstances — remains substantially constant from period 
to period, with at most only well-defined patterns of change, the 
changing character of the universe can thus be allowed for, at least in 
part. In such cases, forecasts of future changes depend upon a con- 
tinuation of the same rate or degree of change. It is never possible 
to be certain, however, that a new event may not make a sudden 
change or break in the trend — as the declaration of war in September, 
1939, produced a sudden change in prices and markets. 

But what of the independence of successive observations? Does 
the fact of serial correlations mean that correlation cannot be applied 
successfully, and that regression relations found in past cases will 
never work out equally well in practice? 

We have seen already in several cases (notably in Chapters 14 



352 


ERROR OF THE FORECAST 


and 16) that forecasts worked out by extrapolating an earlier formula 
to subsequent years have given results which agreed remarkably well 
with the standard error of estimate. Had we calculated the wider 
standard error of individual forecasts (equation [77] or [82]) for 
these cases, the agreement of the actual errors with the expected range 
of error would have been even better. This agreement is contrary to 
what we would have expected, on the basis of the theories set forth 
above. Is it merely a lucky accident, or does it indicate that the 
sampling equations have a wider applicability than their basic assump- 
tions would lead us to expect? 

This problem is one of the greatest unsolved questions in the whole 
field of modern statistical methods. It is one where the widest pos- 
sible range of judgments may be found among the experts who should 
be able to agree upon the answer. Without presuming to give a final 
answer to the question, we may advance certain considerations to ex- 
plain why correlation results with time series may be more reliable 
than some critics have believed they could be: 

In the discussion to this point, w^e have assumed that after we knew 
the values for, say, 1940, the values for 1941 constituted the next 
observation. Also, we have tacitly assumed that since 1941 will have 
only a single set of values — say of rainfall, temperature, and corn 
yields — ^we are not selecting a new observation at random, but are 
drawing a unique and predetermined set of values. 

Let us examine this a little closer. So far as the trend value is 
concerned, that is so. The trend reading for 1941 (as for variable X 4 
in the problem of Chapter 14) is bound to be one unit larger than for 
1940, and the estimated contribution of trend to the yield is therefore 
expected to be exactly in line with that of preceding years. As ex- 
plained above, this is inherent in our assumption of a gradual yet con- 
tinuous shift in the composition of the universe. But what of the 
values of the other variables? Are they predetermined? 

Rainfall in a given season, so far as meteorologists have been able 
to explain it, is the final result of a large number of accidental circum- 
stances, mainly unpredictable very far in advance. Efforts to forecast 
the rainfall from that of the preceding season or seasons have yieldec 
unsatisfactory results. There seems to be something of an irregular 
periodicity in weather over considerable periods of time, but the un- 
predictable year-to-year fluctuations around that irregular trend are of 
much greater magnitude than the trend itself. Much the same is true 
for temperature. So far as the rainfall and temperature are concerned, 
then, the values encountered in a given year may be regarded as pretty 



THE USE OF ERROR FORMULAS WITH TIME SERIES 353 


completely random ‘'drawings’^ out of nature^s grab-bag of all the pos- 
sible weather combinations that might occur that year. The yield 
of corn, in turn, represents a similar “drawing” out of all the pos- 
sible yields that might accompany that weather combination, for a 
year when the gradually shifting elements in the universe were of the 
magnitude as measured (more or less accurately on its extrapolation 
into the new territory) by the trend. Explained in this way, all the 
observations may be regarded as reasonably “random” sampling from 
the observations that otherwise might have been secured for the given 
year. And if the drawings for that year, as well as the drawings for 
each of the other years, have no particular relation with what other 
observations might have been drawn each year if the forces of nature 
had nodded another way instead, we may feel reassured that the 
successive observations were really random — always excepting the 
factor of progressive change measured, more or less accurately, by the 
trend element. (It must be remembered, also, that this is not a trend 
in yield or a trend in rainfall but rather a trend for the yield secured 
under constant conditions of rainfall and temperature. The trend 
is itself a net regression, measured while eliminating the variation in 
yield associated with changes in the other variables.) For the meteoro- 
logical problem, then, we may feel that we can calculate standard 
errors with reasonable propriety, even though all the data are time 
series. Also, we see that we can construct a reasonably satisfactory 
explanation for why the data behave as if the theory of sampling did 
s-Pply; the sense of the forecast showing about the expected range 
of error. 

But what of time series where the data are economic, and not 
meteorological? How about the steel-cost problem of Chapter 16 ? 
In that problem, steel costs, wage rates, and percentage of capacity 
operated are all parts of a progressing and evolving economic system. 
It was obvious from the data that wage rates changed progressively 
and in most cases slowly, with relatively little change from one year 
to another. Also, it was evident (as revealed eventually in the trend 
factor) that costs changed gradually and progressively with relation 
to such factors as price level which also changed with time and which 
were not otherwise explicitly recognized in the analysis. 

Once the trend allowance has been made to take care of the chang- 
ing nature of the universe (including, in this case, rough and imperfect 
allowance both for technology changes and price-level changes), the 
ordering of the data in time has no influence on the final correlation 
result. If the costs adjusted for the (net) trend were taken as the 



354 


EEROR OF THE FORECAST 


dependent factor, the observations could be jumbled in any sequence 
selected and the net regressions on wages and percentage capacity 
would still be the same. Although wages cannot be regarded as random 
events for any year, the cost rate can be regarded as a random selec- 
tion from all the possible cost rates that might accompany that par- 
ticular wage rate and operation rate. It is precisely by revealing the 
distribution of cost rates for given values of the other factors that 
the net regressions are determined. So it seems that from this point 
of view we can say that the cost (adjusted for net trend) which ac- 
companies a given combination of values of each of the other factors 
is a random sample from all the costs that might accompany that 
combination — and that the error formulas may therefore be used. 

There are two limitations, however, on the conclusions concerning 
the possible independence of events, even in economic time series. 
(1) In the steel-cost problem, there was some indication of a lag in 
effect. For example, an increase in wage rate one year (because of 
accounting difficulties or even because of length of process in the 
production of finished products) might not show up fully in per-unit 
costs until the next year, and vice versa. To the extent that values 
of the dependent factor one year reflect not current year values but 
previous year values of the independent factors, that might introduce a 
systematic error in the regressions. This error would be particularly 
serious if the serial correlation was high for the independent factor 
involved, such as the wage rate. Only if such a lagging effect was 
specifically allowed for and measured by introducing the wage rate 
of the preceding year as an additional independent variable affecting 
the cost of the given year, could such erroneous results in the regres- 
sions be definitely eliminated. 

(2) In the previous examples we have been using illustrations — 
crop years or production periods — ^where there is a natural break be- 
tween the successive intervals. Suppose we were studying corn prices, 
though, and took weekly prices, weekly national corn supply available, 
and weekly values of other factors. Over four years we should have 
208 separate sets of values. Should we be correct in saying we had 
208 independent observations of the effect of these several variables on 
corn prices, in the same way that we could say that observations of 
corn yields for 25 years gave us 25 independent observations of the 
factors influencing yield? Obviously, we should not be correct. In- 
stead of having observations of 208 different phenomena, we have a 
number of repeated observations of what are essentially the same phe- 
nomena. To calculate sampling errors at all, or even to judge what n 



THE USE OF ERROR FORMULAS WITH TIME SERIES 355 


to use in applying the various corrections to our constants, we must 
take our successive observations of time-series phenomena at suffi- 
ciently long intervals so that we sample essentially different phenomena 
at each successive observation. With crop years or other natural breaks 
in the process, it is easy to pick such appropriate breaks. Where the 
process is a continuous one, such as the production of steel, it is more 
difficult to select appropriate intervals for making the ^^cuts” to pro- 
vide independent cross-sections of the continually changing phenomena. 
If the variables represent the total of output or production over given 
periods, these successive ^^cuts” must be uniformly spaced in time. 
If the variables are rates in time, however, that would not be essential, 
and the observations might be spaced at varying intervals so as to 
catch particular values of independent variables. Thus in the steel- 
cost problem of Chapter 16, we might have selected the observations 
from the months in which operation rate (percentage of capacity) made 
new highs or lows, and used those as the basis for our analysis. (For 
the reasons set forth in Chapter 20, such selection could be applied only 
to independent factors. If applied to the dependent one, it would 
seriously bias the results.) The guiding principle must be to select 
the observations in such a way as to make each successive observation 
a new observation of a new set of the variable phenomena, except for 
such continuously progressive elements as can be appropriately elimi- 
nated by the simultaneous fitting of a trend or trends.® 

The preceding discussion suggests some ways in which economic 
time series can be examined to see if conditions are present which would 
prevent the theory of sampling from applying, or can be selected so 
as to make the use of the theory reasonable. If they are found or can 
be made reasonably free from such conditions, forecasts based on such 
analyses might be expected to follow reasonably w^ell the error limits 
given by the formulas based on sampling theory. That may explain 
why forecasts from economic time series, such as those shown for the 
steel-cost problem of Chapter 16, may in fact agree quite well with 
what would be expected if the error formulas did apply. 

Where there is clear indication of lagging effects from period to 
period which cannot be specifically allowed for, or where the serial 
correlations in the data are so high as to make the several observations 

A rough method for jiidging the length of intervals necessary to obtain such 
independent values is given by L. R. Hafstad, On the Bartels technique for time- 
series analysis and its relation to the analysis of variance, Jonrnal of the Am.ericnn 
StatisticM Association, Vol. 35, July, 1940. Pages 347 to 353 are especially germane 
to the discussion here. 



356 


EKROR OF THE FORECAST 


not really independent observations at all, then the sampling formulas 
simply do not apply, because the assumptions on which they are based 
are not fulfilled in the given problem. In such cases the error formulas 
may still be calculated, in the hope that they will indicate the minimum 
possible reliability of the results instead of the maximum possible 
unreliability. But whether they will do even that correctly is not 
yet known. 

In closing this discussion of the time-series problem, one word may 
be added on practical procedure. That is on the possibility of testing 
the actual forecasting efficiency of an analysis by saving the last two 
or three observations as test values on which to try out the adequacy 
of the regression equation derived from the earlier observations. This 
technique, illustrated in several of the examples analyzed in previous 
chapters, has been found a useful precaution in practical research work. 
But sometimes the test values will indicate a sudden change in the level 
of the trend, or will suggest a change in one of the other regressions. 
In such cases it will usually be better to recalculate the data, using 
all the information down to the latest year, rather than to base tlio 
real forecast for the next year to come on an extrapolation already 
found to be in doubt. For, as illustrated in several of the problems 
and as demonstrated in the first section of this chapter, the longer the 
extent of extrapolation beyond the base data, the greater the possibility 
of error. And that applies just as definitely when a net trend regression 
is being extrapolated as any other regression curve. So save a few final 
values for test cases — but do not hesitate to add them to the sample 
analyzed, if that is found necessary to account satisfactorily for the 
most recent relations. 


Practical Procedures in Judging the Reliability of Forecasts 

Up to this point this chapter has presented mathematical procedures 
for judging the confidence that can be placed in an individual forecast, 
solely in the light of the information given by the individual sample 
from which the estimating (regression) equation was derived. In 
actual practice, the statistician has usually only this single set of 
sample data to judge from. Usually (and almost universally in time 
series) he cannot draw a new sample and contrast the results of the 
second sample with the first. He may, however, have other prior 
information to help guide him in making and interpreting his forecast. 
Or he may be able to draw a series of samples, each one throwing light 
on a different aspect of the same problem. Each one of those samples 



PRACTICAL PROCEDURES IN JUDGING RELIABILITY 357 


may give results subject to a wide margin of random error. Yet if the 
results of the several different approaches are all consistent with one 
another, the whole set together will provide a more dependable basis 
for a forecast than is indicated by the calculated standard errors for 
any one taken separately. Other relevant information may be of a 
quite non-quantitative nature, yet it may serve to help guide the 
analyst in making his forecast. 

Any forecast is hazardous, for the future can never be perfectly 
known. Yet life always consists in making plans for the future. Every 
business man, every farmer, every consumer is constantly making 
judgments as to the future, and making commitments or taking action 
based upon those judgments. The success or failure of the actions in 
reaching the ends sought often depends in large measure upon the 
accuracy of those estimates. In every-day life such estimates are 
usually based on hunches, waves of opinion, the most recent happen- 
ings, rule-of-thumb analyses, or even blind guess-work. If the statis- 
tician is to serve society, it must be on this action front, where his 
analysis of past relations will help provide a surer guide into the 
unseen over the horizon. Many statisticians hesitate to make fore- 
casts, for they know how little statistical dependability they can place 
in them. They fear to risk their reputations upon forecasting an 
uncertain event. They need not be so hesitant. To be useful, all 
that is required is for their forecasts to be more reliable, on the average, 
than the forecasts on which such judgments have been based in the 
past. Many business forecasting services have been accurate only to 
55 per cent, yet have kejit in business on the gain over the 50-50 odds 
of the completely uninformed guess. In events characterized by waves 
of emotions or by common response of many individuals to the same 
stimuli; as in the business cycle or the hog cycle, the accuracy of the 
uninformed guesses — and actions — of pro(lu(‘(‘rs may have averaged 
only 30 to 40 pe^ cent right. Even a forecast which is very fallible 
when judged by its mathematical or statistical significance may yet 
yield a greatly improved guide to human actions. If the statistician 
will base his forecast on all the information at his command, quantita- 
tive and non-quantitative, and will guard his forecast by some state- 
ment as to its range of dependability, he can both aid judgments and 
protect his reputation. In the end, he must be willing for that reputa- 
tion to rest U])on the average accuracy of a long series of estimates 
rather than upon the lucky calling of any one individual event. And 
the more the technical ofx'rations on the statistical side can be rein- 
forced by the knowledge, theories, experience, and judgments of the 



358 


ERROR OE THE EORECAST 


researcher as practical agronomist, sociologist, economist, meteorologist, 
or other technician, the more valuable the statistical operations will be- 
come as a basis for an informed and useful projection from the events 
of the past into the still-malleable future/ 

Summary. In this chapter we have discussed the problem of the 
accuracy of individual estimates of the dependent variable for new 
observations drawn from the same universe as the original sample and 
have presented methods of estimating the probable range of error for 
such estimates. We have considered the question of whether the new 
observations of independent factors represent portions of the distribu- 
tion of the same factors present in the sample or whether they include 
new and therefore untested values or combinations of the same varia- 
bles, with resulting unpredictable effects upon the estimates of the 
dependent variable. Finally, we have discussed the question of the 
applicability of error theory to time series, shown how in many cases 
it may still be applied, and indicated some rough tests to judge whether 
or not the time series is or is not of a character that will make error 
calculations completely inapplicable. Finally, we have given some 
hints for the handling of time-series correlations in such a way as to 
minimize the errors when extrapolating the regressions for purposes 
of practical forecasting, and have made some suggestions for combining 
tests of significance with other prior information as a basis for making 
and judging forecasts. 

REFERENCES 

Waugh, Frederick V., and Been, Richard 0., Some observations about the validity 
of multiple regressions, Statistical Journal of the College of the City of New 
York, Vol. 1, No. 1, pp. 6-14, January, 1939. 

Schultz, Henry, The standard error of a forecast from a curve, Jovrrwil of the 
American Statistical Association, pp. 139-185, June, 1930. 

'^Compare this discussion with the concluding section of Chapter 2, pages 
30 to 32. 



CHAPTER 20 


INFLUENCE OF SELECTION OF SAMPLE AND ACCURACY OF 
OBSERVATIONS ON CORRELATION RESULTS 

Selection of Sample 

Methods of determining linear and curvilinear regressions, together 
with appropriate measures of their significance and accuracy, have 
been set forth in previous chapters. These methods do not yield results 
representative of the universe from which the sample observations have 
been drawn, however, if that sample is not truly representative of the 
particular relation being determined. There are various ways in which 
the sample may fail to represent the universe, and the resulting extent 
to which the correlation constants will be biased will vary both with the 
character of the unrepresentativeness and with the individual coeffi- 
cients. Each type of abnormality must therefore be treated separately. 

The samples may be selected from the universe in such a way as 
to exclude all the observations falling beyond a certain value of a 
given variable, thus ruling outvalues either at one or at both extremes, 
or perhaps ruling out middle values and selecting only extreme ones. 
This may be done for either the dependent variable or the independent 
variable or variables, or for both together. Such a selection of observa- 
tions produces certain specific effects upon the correlation constants. 
Under some conditions it may be very desirable to select the observa- 
tions in this way, if only the resulting aberrations in the correlation 
constants are recognized and allov/cd for. 

A second and somewhat more difficult type of problem to deal with 
arises when there are errors of measurement in obtaining the values 
of one or more of the variables — such errors as might arise, for example, 
in estimating the total production of corn in the United States within 
a given year, or in working out, from a farmer’s memory, what was the 
income on his farm the previous year. Here again the effect on the 
correlation constants will depend upon whether the errors are random 
or biased and upon which variable or variables are affected by the 
errors. A separate discussion must therefore be given each case. 

The clearest way of indicating the effect of these various departures 
from truly representative sampling may be by first stating the general 

359 



360 SELECTION OF SAMPLE AND ACCURACY OF OBSERVATIONS 


principles involved, and then illustrating the way those principles work 
out by concrete illustrations. Except where specially stated otherwise, 
the discussion will apply solely to linear relations. The effects for 
curvilinear relations are in general analogous to those to be discussed. 

Selection of sample with respect to values of independent variable. 
If the sample is selected with respect to values of the independent 
variable, that will not tend to affect the slope of the regression line but 
will affect the value of the coefficient of correlation. If the selection 
is such that extreme values are rejected but intermediate ones are left 
in, the correlation will be lowered below that prevailing in the universe; 
if intermediate values are rejected and only extreme ones are used, the 
correlation will be raised above that prevailing in the universe. If the 
values of both variables are normally distributed, the standard error 
of estimate will tend to remain the same, regardless of the selection. 

These principles may be illustrated by the set of hypothetical data 
shown in Table 76. 

TABLE 76 


COREBLATION TaBLE, SHOWING HYPOTHETICAL FREQUENCIES AT SPECIFIED VALUES 


Values of Y 

Values of X 







0 

1 

2 

3 

4 

0 



1 

] 

1 

1 


1 

2 

2 

1 

2 

1 

2 

4 

2 

1 

3 

1 

2 

2 

1 


4 

1 

1 

1 




For the data shown, r = - 0.47, — 1.134, and hyj, “ —0.50. 

If now the values had been selected with references to so as to 
exclude values below 1 or above 3, the number of observations would 
have been reduced from 28 to 22. For this restricted set of observa- 
tions, r = “- 0.26, uj: — 0.739, cry = 1.09, but hya, = —0.50. Comput- 
ing the standard error of estimate, we arrive at = 1.02 for the first 
case and 1.07 for the second. It is quite apparent that th<* correlation 
has been lowered by the restriction in the selection of the values of X] 
but the regression of y on x has not been changed at all, and the 
standard deviation of the residuals has been only slightly changed. 

If now the selection is such that only extreme values of X arc taken, 
say below 1 and above 3, the number of observations is reduced to 6. 



VALUES OF DEPENDENT FACTOR 


361 


Computing .the results for those values, we have o-o, = 2.00, ay = 1.29, 
r = — 0.71, but bya, = — 0.50! Also Sy.a, = 1.00. 

Bringing the three sets of results together for comparison, we have 
the following tabulation. 

With X used as independent variable: 



<Tx 



Syjc 

^yx 

All cases 

1.13 

1.13 

-0.47 

1.02 

-0.50 

Extreme values of X excluded 

0.74 

1.09 

-0.26 

1.07 

-0.50 

Only extreme values of X used .... 

2.00 

1.29 

-0.71 

1.00 

-0.50 


These three examples thus illustrate the principles stated before: 
that selection with respect to the independent factor does not tend to 
change the regression or the standard deviation of the residuals but 
does affect the correlation, lowering the correlation if it has lowered 
the dispersion of the independent factor and raising the- correlation if 
it has increased the dispersion of that factor. 

Selection of sample with respect to values of dependent factor. 
Selection with respect to values of the dependent factor is more serious, 
in that it affects all the constants. According as the effect is to raise 
or to lower the standard deviation of the dependent factor, such selec- 
tion tends to raise or lower both the regression coefficient and the 
coefficient of correlation from the value for the universe and likewise 
to raise or lower, respectively, the standard error of estimate. 

These principles may be illustrated from the three examples just 
used, by regarding X as the dependent factor and Y as the independent 
factor and noting the influence of the selection with regard to the 
dependent factor, X^ upon the regression of X on V, h^y. For the first 
case, with all values left in, — 0.50 and — 1.02. For the 

second case, however, with extreme values of X left out, h,ry drops to 
— 0.23 and Sj, y becomes 0.73. For the third case, with only extreme 
values of X included, 5.,.^ increases to —1.20 and becomes 1.55. 
Bringing these three sets together yields the following comparison. 

With A" used as dependent variable: 



cTx j 

cry 

^ xy 


5x1/ 

All cases 

1.13 

1.13 

- 0.47 

[ 

1.02 ' 

-0.50 

Extreme values of X excluded 

0.74 

1.09 

- 0.20 

0.73 

- 0.23 

Only extreme values of X included. . 

2.00 

1.29 

-0.71 

1 . 55 

- 1 . 20 



362 SELECTION OF SAMPLE AND ACCIJRACY OF OBSERVATIONS 


These results indicate the extent to which selection with regard 
to the dependent factor may completely destroy the significance of 
all the results. 

Selection of samples with, reference to values of both, variables. 

Selection of cases with reference to values of both independent and 
dependent variables has an even greater effect upon the conclusions 
than the two cases discussed because selection of extreme values tends 
to exaggerate the correlation and regression, and of central values to 
lower both, to even greater extent than where the selection is with 
respect to the dependent factor alone. 

If, in the data of Table 76, only those cases are selected in which 
values of X below 2 are associated with values of 7 above 2, and in 
which values of X above 2 are associated with values of Y below 2, 
the observations are reduced to ten cases, as follows: 


Values 
of X 

Values 
• of y 

Number 
of cases 

Values 
of X 

Values 

of y 

Number 
of cases 

0 

3 

1 

3 

0 

1 

0 

4 

1 

3 

1 

2 

1 

3 

2 

4 

0 

1 

1 

4 

1 

4 

1 

1 


P^r these values, = 1.48, cry = 1.48, to^y = — 0.90, hya> = — 0.91, 
and Byx — 0.68. 

It is evident that such selection raises both the correlation and the 
regression above the true value for the universe. This is to be expected, 
for this selection is equivalent to picking out the pairs of values which 
do show correlation with each other. Restricting the selection to paired 
values of above 1 for both variables, and below 3 for both variables, 
likewise would be picking out cases so as to eliminate all correlation. 
Such selection obviously destroys the value of the results. 

Conclusions with reference to selection of data. If an investigator 
is interested only in the regression line and not in the degree of correla- 
tion, and if the regression is truly linear, selection of data with refer- 
ence to the independent factor (or factors) will not tend to change the 
slope of the regression line (or lines). Under those conditions selection 
of extreme cases of the independent factor may yield a reliable indica- 
tion of the regression with much fewer observations than if the cases 
were selected at random. This principle is frequently applied in experi- 



CONCLUSIONS WITH REFERENCE TO SELECTION OF DATA 363 


mental or laboratory work, but is equally applicable in other types of 
investigations. 

If the regressions are curvilinear, however, special selection of 
either extreme or central items of the independent variables forestalls 
the determination of the nature of the function, since curvilinear regres- 
sions can be determined only for the ranges of the independent factor 
within which observations have been secured. For such regressions, 
therefore, the nature of the function may be more accurately deter- 
minded if the independent items are selected so as to be spread fairly 
uniformly through the whole range of values, thus affording a suflBi- 
cient number of observations for accurate determination of the nature 
of the relation throughout the whole range. Selection purely at random 
frequently provides more observations than are needed for certain 
portions of the curve, and provides so thin a scattering of observations 
at other portions as to make its true position and shape quite indeter- 
minate, as has been illustrated previously. Even if curvilinearity is 
only suspected, such a uniform distribution of values for the inde^ 
pendent variable provides an improved basis for determining whether 
or not the regression is truly linear, as compared with an equal number 
of observations selected at random. At the same time, where the 
dependent factor is normally distributed, selection with reference to 
the independent factor does not tend to change the standard error of 
estimate. 

If the primary interest, however, is not in the nature of the relations 
and in determining how closely values of the dependent factor may be 
estimated (regressions and standard error) , but instead is in determin- 
ing what proportion of the original variation in tlie dependent factor 
can be accounted for on the basis of tlie relations determined (correla- 
tion and determination), then anything other than random selection 
with reference to any factor will give estimates of the closeness of the 
correlation winch either over- or underestimate the true correlation 
in the universe from which the sample is drawn. For most accurate 
results in such problems, the distribution of the dependent factor in 
the sample sliould be an accurate representation of the distribution in 
the universe from whicli the observations were drawn, and the only 
selection whicli would be justified would be aimed at securing such a 
sample. 

Since tlie correlation coefficient or index, and the parallel measures 
of determination, are of significance only with resiiect to the standard 
deviation of the observed values of the dependent factor, it follows 



364 SELECTION OF SAMPLE AND ACCURACY OP OBSERVATIONS 


that when the dependent factor has such an abnormal distribution that 
its standard deviation is of little value as a descriptive statistic, the 
measures of correlation also tend to be of little value. For any series 
which actually yielded such an extreme distribution as the dichotomous 
values used in the third case of those just illustrated, measures of 
correlation would have little significance except their formal mathe- 
matical definition. Yet the regressions and standard errors of estimate 
would tend to retain all their usual value and significance, so long as 
no selection had been made with reference to values of the dependent 
variable. In such a case, attempting to select the values of the depen- 
dent factor so as to make the series more nearly normal might seriously 
bias the regression results. 

Accuracy of Observations 

The data with which the statistician has to deal are frequently 
subject to errors of observation. If corn yields are being studied in 
relation to fertilizer applications, for example, farmers may be able 
to estimate the yield per acre on a given tract only to within 5 or 10 
bushels of the true yield. If livestock prices are being studied, the 
market reporter may not be able to get his daily average nearer than 
wdthin 10 or 25 cents per 100 pounds of the true average of all the 
sales for the day. Or if educational ratings are being studied, the 
instructor may not be able to grade the test papers nearer than to 
within 5 or 10 per cent of the grade each really deserves. All these 
illustrations are akin to the difficulties of the surveyor, who finds he 
cannot measure his angles more accurately than within a certain 
number of seconds ; or of the astronomer, who finds his repeated obser- 
vations disagree from each other by fractions of a second. But the 
errors of measurement are ordinarily tremendously greater in bio- 
logical, economic, or social investigations than in physical observations; 
and for that reason statisticians must be particularly careful to use 
their data in such a way as to minimize the influence, upon their con- 
clusions, of the errors which may be present. 

Errors of observation may be such that they are not correlated 
with the value being observed, and hence tend to fall equally above 
and below the true values throughout the range of the variable; or 
else they may be such that they are correlated with the variable, tend- 
ing usually to make the observed value fall above the true value in 
the upper part of the range, and below the true value in the lower part; 
or vice versa. 



ERRORS IN THE DEPENDENT VARIABLE 


365 


In correlation problems, there are two sets of true values involved, 
those for the dependent and independent variables ; and there may also 
be two sets of errors, one tending to cause the observed values for the 
dependent variable to differ from the true values, and the other affect- 
ing the independent variable. The extent to which such errors, if 
present, modify or impair the results of correlation analysis depends 
both upon the type of the errors and the variables which they affect. 

If the errors affect only values of the dependent factor, and if they 
are not correlated with the true values, their presence tends to lower 
the correlation and to increase the standard error of estimate, but does 
not tend to change the slope of the regression line from the true slope 
for the universe. If, however, uncorrelated errors are in the inde- 
pendent factor, that not only tends to lower the correlation and increase 
the standard error of estimate, but also tends to decrease the regression 
below the true value. Both of these cases may be illustrated from the 
same set of data used before. 

Errors in the dependent variable. The data used in Table 76 may 
be modified by assuming some random error influences Y, making 
one-third of the values 1 unit higher, one-third 1 unit lower, and leaving 
one-third unchanged. With these changes, the data appear as follows: 


TABLE 77 

Correlation Table, Showing Hypothetical Frequencies at SpEciriED 
Values, with Random Errors in Y 


Values of 1’ 

V.*iluo.s of X 

n 

I 

2 

3 

4 

-1 

0 

1 

2 



1 


1 


1 

2 


1 

3 


1 

2 


2 

3 

3 

4 

1 

2 


3 

1 

1 

1 

5 

1 

1 









For these data, Vyj, — — 0.33, hyj> = — 0.50, and fSVr — 1-46. The 
introduction of the random error into Y has lowered the correlation 
from that of —0.47 for the original values and increased the standard 
error of estimate; but it has had no significant effect upon the regression 



366 SELECTION OF SAMPLE AND ACCURACY OF OBSERVATIONS 


of Y on X, the new value —0.50 being identical with the value of —0.50 
for the original data in Table 76. 

Error in the independent variable. If, however, X is regarded as 
the dependent factor and Y as the independent, the regression coeffi- 
cient for the new values, = — 0.28, is found to be much reduced 
from that of —0.50 for the original values. Introducing even random 
errors into the observations of the independent factor markedly reduces 
the observed regressions below the true value. 

The errors considered to this point have all been random errors. 
If, instead, the errors are correlated with either of the factors, their 
presence would obscure the true relationship and bias any correlation 
constants which might be computed, tending to make them either too 
high or two low, depending on the inter-relations between the errors 
and the variables. 

Errors in both variables. If random errors are associated with 
both variables simultaneously, their effects are a blending of those 
just illustrated, tending to reduce both the closeness of correlation and 
the regression below the true values. For example, if random errors 
of the same magnitude are introduced into X as well as Y of Table 76, 
the values appear as follows: 

TABLE 78 


CORBELATION TaBLE, Sh OWING HYPOTHETICAL FREQUENCIES AT SPECIFIED 

Values, with R andom Errors in Bora X and Y 



With these changes, the correlation is reduced to practically 0, 
the standard error increased to 1.524, and the regression of F on Z 
changed to —0.179. The comparison of these constants with those 
for the original data in Table 76 illustrates the extent to which the 
presence of random errors in the observed values of the variables may 
reduce the accuracy and effectiveness of correlation analysis. 



ERRORS OF OBSERVATION IN MULTIPLE CORRELATIONS 367 

Dealing with errors in both variables. The methods of computing 
the regression line considered to this point are methods which take one 
variable as given, or independent, and the other variable as based upon 
it, or dependent. If it is known that all the errors of observation are 
random and are in one variable, and none are in the other, the effect 
of those errors may best be eliminated by considering the one with no 
errors as independent and the other as dependent. As has just been 
demonstrated, the regression line then obtained will be practically 
identical with that which would be obtained if no random errors at 
all were present. 

In some cases it may be known that both variables are subject to 
random error, yet it may be desired to obtain a regression line which 
most accurately expresses the relation between the two. That can be 
done by a special method, which fits the line on the condition that the 
sum of the squares of the departures of each observation perpendicular 
to the fitted line shall be made a minimum (in contrast to the usual 
condition that the sum of the squares of the vertical departures from 
the fitted line shall be made a minimum, with the dependent variable 
plotted as the ordinate.) This special method involves an entirely 
different procedure for fitting the line, and is not given here. It has 
the disadvantage that it does not give a basis for estimating values for 
either variable from known values of the other, nor does it give a 
basis for measuring the closeness of the correlation between the two. 
It is referred to here merely to call attention to the fact that methods 
are available for determining the regression when both variables are 
known to be subject to random errors.^ 

Errors of observation in multiple correlations. The points which 
have been illustrated here for simple correlation are equally true for 
multiple correlation, both with respect to the influence of selection of 
sample and of the effect of errors of observation. The influence of 
errors of observation in multiple correlation problems may be illus- 
trated by a case based on actual economic data. 

Over the 17 years from 1907 through 1923, the monthly price of 
lambs shows a very high correlation with the price of wool and the 
price of dressed lamb. When is used for the price of wool, in cents 
per pound, X3 for prices of dressed lamb, in cents per pound, and X| 
for prices of live lambs, in cents per pound, multiple correlation gives, 
for the 204 observations, i^i.23 0,991 and ~ 0.144.T2 + O. 354 .T 3 . 

To test what effect random errors would have had on this correla- 


Abraham Wald, Fitting of straight lines if both variables are subject to error, 
Annals of Mathematical Statistics^ Vol. XI, No. 3, pp. 284r”299, September, 1940. 



368 SELECTION OF SAMPLE AND ACCURACY OF OBSERVATIONS 


tion, two dice were thrown 204 times, giving random values from 2 to 
12 . These values were then added to the successive values of the 
dependent, and a similar set of 204 values to the successive observations 
of one independent factor, to see what effect that would have on the 
results. In the following tabulation the notation is used to 

designate the variables to whose values these “random errors^’ had been 
added. 

Effect of Inthodtjcing Random Errors on Correlation Results 


Independent 

variables 

1 

Dependent 

variable 

Multiple 

correlation 

Regression equation 

X2 and X3 

Xi 

0.991 

0 . 144a:2 + 0. 354x3 

X2 and X3 

Xi + e 

0.821 

0.112x2 +0.424x3 

X2 and X3 -f" e 

Xi 

0.953 

0.163x2 +0.277x3 

X2 and X3 “h e 

Xi+e 

0.804 

0.152x2 + 0.306x3 


These results illustrate the principles just set forth. The introduc- 
tion of random errors into the dependent variable (Xi) reduces the 
correlation, but does not greatly change the size of the two regression 
coefficients. It would appear, especially from the amount of the reduc- 
tion in net regression on Z 2 , that the errors in this case may not have 
been completely randomly distributed and uncorrelated with Xi, X 2 , 
and X 3 , even though determined by throws of dice. 

But the second modification, where the error is introduced into the 
independent variable X 3 instead, is much more striking. The correla- 
tion is not reduced so much as in the first case, and the regression of 
on X 2 is changed only slightly from the original value — and increased 
as it happens. The net regression of Xi on X 3 + e, however, is only 
three-fourths as large as was the net regression of X^ on X3, in spite of 
the fact that the error introduced was only enough to raise the standard 
deviation of X 3 from 6.14 to 6.64. 

The final case, with errors introduced into both X^ and X3, shows 
the lowest correlation of any, as would be expected. The net regression 
of Xi + e on X 2 is but little different from what the regression of Xi 
on X 2 was, whereas the net regression of X^ -f e on X3 + <3, though 
larger than what the regression of X^ on X 3 + e was, is still definitely 
lower than the regression of X^ on X3. The regression equation in this 
last case, where X^ *f e is the dependent, is not greatly different from 
what it was in the preceding case with X^^ as the dependent, in spite 



ERRORS OF OBSERVATION IN MULTIPLE CORRELATIONS 369 


of the fact that one of the independent variables— X3 — ^had a signifi- 
cant random error of observation in its values both tinaes. 

These cases illustrate the extent to which random errors may con- 
fuse the true relations, if they are allowed to creep into the observa- 
tions. Just how great an effect upon the results such random errors 
will have depends upon the magnitude of the errors, the original varia- 
tions in the variables, and the closeness of the inter-correlation. 
Although equations can be derived to show how great a reduction in 
correlation errors of a given magnitude will produce, they are of little 
practical use in economic work, since it is usually difficult enough to 
determine whether there are errors of observation or not, much less to 
determine what magnitude they have.^ In using reports or estimates 
of prices or commodity production or supply, we know that the data 
are nearly always subject to more or less error. The same is true of 
many other economic data — errors of observation of greater or less 
magnitude are nearly always present. It may be of some slight reas- 
surance to know that observational errors even as large as those 
introduced in the example just considered still modify the regression 
results as little as these have been seen to do. 

The practical significance of the principles which are stated here 
is that, if there is known to be a large but random error in observing 
some variable, that variable may still be used as the dependent variable 
in a correlation study without making the regressions or estimating 
equation very far wrong, if determined with a large number of cases; 
but, on the other hand, any use of that variable as an independent 
variable will be certain to yield results which understate the actual 
relations. 

2 In the problem given, the significant values determining tlie clYcct of the 
errors are: 

(Ti =3.96 =4.74 

or:{ = 6.14 cii-i-e =6.64 

If the errors are in tlie dependent variable alone, the relat ions between the true 
and the apparent correlation are indicated by the ecpiation: 

r>2 ■ . . fJq'i 

A(h-<!). 23 . . . n — 2 I 2 

<ri + (Te 

This gives what the new correlation woidd be if the errors were truly random, 
so that the new regression equation came out as identical with the old. In the 
problem given, this gives an expected value for Ji of 0.827 as compared to the 0.821 
actually obtained. 


= m.: 


1 



370 SELECTION OF SAMPLE AND ACCURACY OF OBSERVATIONS 


In cases where the errors are biased, they tend to make the results 
of correlation analysis more or less in error, quite regardless of the 
variables to which they apply. If the errors tend either to magnify 
or to minimize the differences which actually exist, they will have a 
parallel effect on the regression coeflScients if they apply to the depend- 
ent variables and an inverse effect if they apply to an independent 
variable. There are so many different types of bias, however, that no 
more definite statement of the effects can be laid down. 

Random errors have the same type of effect in the case of curvi- 
linear correlation that they do in linear correlation, since if they are 
truly random they will tend to be balanced out along all the portions 
of the regression curve alike, if in the dependent variable ; or tend to 
confuse the relations along the curve, if in the independent variable; 
and so reduce the differences observed. 

Biased errors, on the contrary, may happen to be concentrated 
along certain portions of the range, and hence have a much more 
marked effect at one point than at another. Although this might 
seriously disturb the significance of the curve, it probably would have 
an equally disastrous effect on the reliability of the straight line. 
About the only real difference between linearity and curvilinearity 
with regard to errors is that random errors in the dependent variable 
could be “balanced out^^ in the case of a straight-line regression with 
a somewhat smaller number of observations than would be necessary 
to secure valid results for a curvilinear regression. 

Where, with random errors in the dependent factor, there are not 
enough cases available to “balance them out,’^ the effect of the errors 
is to throw a varying amount of error into the conclusions, the exact 
amount of the error depending on how closely the errors approach being 
canceled out. The illustrative case, where with over 200 observations 
the regressions were still changed somewhat, probably indicates what 
may be obtained by a combination of slight departures from true 
“randomness’’ in the errors with a sample not quite large enough to 
eliminate entirely all the resulting instability. This may be nearer to 
what would usually happen in practice than the theoretical complete 
elimination of the errors in the dependent variable. 

Summary. Modification of the observations from the true condi- 
tions, either by selection of the sample or by the presence of errors 
of observation, tends to alter the value of the coefficient of correlation. 
If the regression line or curve is of primary interest, however, its 
accuracy of determination may be increased by suitable selection of 
observations with respect to independent factors. Similarly, random 



SUMMARY 


371 


errors of observation may not influence the regressions, if the factor 
they affect can be treated as the dependent factor and if enough 
observations are available to balance out the errors. These points hold 
true for multiple correlation problems as well as for 2-variable 
problems. 



CHAPTER 21 


MEASURING THE RELATION BETWEEN ONE VARIABLE AND 
TWO OR MORE OTHERS OPERATING JOINTLY 

In working out the change in one variable with changes in other 
variables up to this point we have assumed that the relation of the 
dependent factor to each independent factor did not change, no matter 
what combination of other independent factors was present. In the 
case of the yield of corn, for example, as worked out in Chapter 14, 
we assumed that the effect of a given change in rainfall upon the yield 
was the same, no matter what was the temperature for the season. 
The significance of this assumption may be shown by combining the 
estimate for rainfall with the estimate for temperature, and plotting 
the combined influence of the two variables. In Table 68 (on page 252) 
we already have this combined influence worked out, so all we have 
to do is to plot it. Figure 62 shows the resulting figure. In reading 
this figure it should be noted that the inches of rainfall are read along 
the right-hand edge of the bottom of the cube, the degrees of tempera- 
ture along the left-hand edge, and the yield along the vertical edge. 
The yield for any combination of temperature and rainfall is then 
shown by the distance the upper surface of the solid figure is above 
the point of intersection of the corresponding values in the base plane.^ 

Inspecting Figure 62, we can now see what is meant by saying that 
the changes in yield are assumed to be the same for each change in 
rainfall, no matter what the temperature. As shown in the figure, the 
maximum yield with a temperature of 70° is obtained at about 12 
inches of rain — and that is also the rainfall which produces a maximum 
yield with a temperature of 72°, 74°, or 78°. Each curve has just 

^ The way this figure is made may be thought of as follows: Suppose we drew a 
series of charts of the estimated differences in yield with differences in rainfall, with 
one chart for an average temperature of 70°, one for 72°, one for 74°, etc. Then if 
we cut these charts off at the yield line, and arrange them one back of the othei’, 
at even distances, we have a figure looking much like Figure 62. Tlio lines sloping 
across the surface from left to right represent what would be tlie tops of this series 
of charts. (In this figure the estimates are charted for all (combinations of the two 
variables, even for some not represented in the sample and not shown in Table 68.) 

372 



JOINT FUNCTIONAL REGRESSION 


373 


exactly the same shape, and the only difference is their elevation above 
the base. On looking at it the other way, we find that the same is true 
of temperature. A¥ith 9 inches of rainfall the maximum yield is ob- 
tained at about 75'^ temperature, and the maximum is also at 75° with 
other levels of rainfall. This relation necessarily follows the assump- 
tions made in measuring it. Figure 62 merely shows the estimate we 
get by the use of equation (54) : 

X, - a + hiX2) + hiXs) 



Fi(j. 62. yield of corn for various specific (!onibinaiions of rninfall and 

ieinperai.ure, from niiiltii)lc curvilinear (‘orrelation. 


In working out those estimates we simi)ly add together tlic esti- 
mated value for X 2 and the estimated value for X 3 . It docs not make 
any difference what the value of is, the changes in Xi assumed to 
accompany i)articular changes in A "2 are the same — and that is what 
the figure shows. 

Ordy a little reflection is needed to indicate that Figure 62 may 
not tell the whole truth of the relation of yield to rainfall and tem- 
[)erature. It is quite possible that the crop can use more rain in a 
hot season than in a cool one, so that the rainfall which will produce 


374 


JOINT FUNCTIONAL REGRESSION 


the maximum crop may be higher in a season of high average tem- 
perature than in a season of low temperature. If that is really the 
case, equation (54) is unable to express the relationship, for, as just 
pointed out, that equation assumes that the change in yield with rain- 
fall is the same, no matter what the temperature. 

An extreme illustration of a changing relationship is shown in 
Figure 63. This figure, which is based on actuarial investigations,^ 



Fig. 63. Differences in mortality with differences in weight, for men of various 
ages. (Each in percentage of average mortality for that age.) Illustration taken 
from an article by Andrew Court. 

shows the differences in mortality among men from the usual rate, 
for differences in weight at different ages. Taking the 22-year line, 
for example, we see that men who are much over normal weight 
have a much higher mortality than normal for that age. Then as the 
weight is less the mortality is less, until at normal weight there is only 
normal mortality. But as the weight drops still more, the mortality 
increases again, until below 80 per cent of the normal weight the 
mortality is more than 20 per cent in excess of normal. 

The relation is somewhat different for 52-year-old men, however. 
For them the mortality is also higher for those who are above normal 

2 Medico- Actuarial Investigations y Vol. II, p. 24, 1913. 


JOINT FUNCTIONAL REGRESSION 


375 


in weight and decreases as normal weight is reached. But as the 
weight falls below normal the mortality continues to decrease, until 
for men who are only 70 per cent of normal weight, the mortality is 
more than 15 per cent below the normal for that age. For ages inter- 
mediate between these two, the change is also intermediate — as is 
shown in the chart, 27 years is similar to 22, but not so marked, and 
the line for 47 years is similar to that for 52. At 42 years, there 
is apparently little difference in mortality anywhere between 70 
per cent of normal weight and 100 per cent. 



Fia. 64. Relation shown in previous figure, represented by equation 
Xi=^/o(Xo) +/3(X3). 

Figure 63 illustrates a situation which the previous methods of 
analysis would be quite incapable of dealing with adequately. Were 
equation (54) used to represent this relation, the higher mortality 
with lower weights for young men would tend to balance out the 
lower mortality for the older men at the same weight. In fact, the 
erroneous conclusion miglit be reached that the age does not affect 
the mortality at all. Figure 64 shows the results of an attempt to 
represent this relation by the methods previously discussed. It is 
quite obvious that the results fall far short of the relations as shown 
by Figure 63. 


376 


JOINT FUNCTIONAL REGRESSION 


Use of ‘^joint functions” to show combined effects. What is needed 
in both the corn-yield problem and the mortality problem is some way 
of determining what the yield, in the one case, or the mortality, in 
the other, is most likely to be for any given combination of the two 
independent variables. That is quite different from asking for the 
separate effect of each one. Obviously, a small change in one inde- 
pendent factor will be expected to be accompanied by only a small 
change in the dependent, so that all the estimated yields (or mor- 
talities) will be expected to lie along a continuous surface like that 
shown in Figure 62 or 63 ; but the surface will be free to warp or change 
its shape in different portions like the surface shown in Figure 63, 
instead of being held rigidly to the same shape in each dimension, like 
the surfaces in Figure 62 or 64. Mathematically, such a changing 
relation between one variable and two or more others is known as a 
joint functional relation, and may be indicated by the equation: 


= /(Z 2 , X 3 ) (84) 

This is read simply that is a joint function of X 2 and X 3 .” 
That means only that, for any combination of values of X 2 and X 3 , 
there will be some particular value of X^. Equation (84) is therefore 
capable of representing either a relation such as that shown in Figure 
62, or the more complex relation shown in Figure 63. 

The problem of determining the extent to which corn yield varies 
with the joint effect of temperature and rainfall may be said to be 
one of determining the functional relation of yield to the two other 
factors, according to the relation shown in equation (84) . 

Determining a joint function for two independent variables. 
Where only two independent variables are concerned, the joint func- 
tional relation may be determined quite simply, if a large enough 
number of observations is available. 

The process may be illustrated by data from a different problem, 
shown in Table 79. The observations are from a field study of hay- 
stack dimensions in the Great Plains area. Farmers in this area 
ordinarily sell their hay unbaled and in the stack. It is therefore 
necessary to estimate the quantity of hay each stack contains. Two 
measurements, which can be made readily with only a rope, are 
usually employed — the perimeter around the base of the stack and 
the '^over,'^ or the distance from the ground on one side of the stack 
over the center to the ground on the other. The observations shown 
in Table 79 are all for round stacks. These stacks vary in height 



JOINT FUNCTION FOR TWO INDEPENDENT VARIABLES 377 


and shape to some extent, however, so their volume cannot be com- 
puted from the basal circumference by any simple mathematical rule, 
such as for the volume of a hemisphere. The volumes shown in the 
table are computed from careful surveying measurements of all the 
dimensions of each stack — much more exact measurements than a 
farmer would be able to make in practice. The problem is to estab- 
lish the average volume for specified circumferences and '^overs,’' so 
the farmers may be able to use these two measurements, and also to 
determine how much confidence can be placed in estimates of volume 
based on these two factors. 

The volume will tend to be some function of the basal area times 
the height. The basal area is a function of the square of the basal cir- 
cumference; the ^‘over’^ is a function of both the basal diameter and 
the height — but attempts to separate the two have been unsuccessful. 
It is obvious, however, that any attempt to represent the relation by a 
regression equation of the type 

volume = / (circumference) +/ over ”) 

will be unsatisfactory because of the multiplying nature of the relations, 
that is, 

volume = / (circumference) (over) 

Such a relationship may be approached by use of the relation 


l^Svolume f G^gcircuniference) “h f (logover) 

Attempts to determine the relationship by this equation, however, have 
not been fully successful. The shape of the stacks apparently shifts 
with changes in size. 

The haystack problem is evidently one where the relation may best 
be expressed by a joint function such as 

volume = / (circumference, over) 

Such a relation could be determined directly from the data by the 
methods which will presently be described. It is evident that the cor- 
relation surface would have a marked upward slope as the two dimen- 
sions increased together, even if the usual volume formulas applied. 
The work for this particular problem may be somewhat simplified by 
first stating each variable as a logarithm and then determining the 
joint relation according to the equation 


log 


volume 


f (l^gcircuniferencoj l^Sovor) 



378 


JOINT PXJNCTIONAL REGRESSION 


TABLE 79 

Data Taken feom Nebeaska Round Stacks Measukbd in 1927 and 1928t 


Volume, 
in cubic 
feet 

Circum- 
ference, 
in feet 

“Over/’ 
in feet 

Xi* 

Xz* 

Xi* 

Xl 

z 

2853.00 

69.0 

37.00 

0.139 

0.168 

0.455 

0.478 

-0.023 

2702.00 

65.0 

36.50 

0.113 

0.162 

0.432 

0.450 

-0.018 

3099.00 

73.0 

38.50 

0.163 

0.185 

0.491 

0.447 

0.044 

1306.00 

62.5 

26.50 

0.096 

0.023 

0.116 

0.143 

-0.027 

2294.00 

70.0 

35.00 

0.145 

0.144 

0.361 

0.436 

-0.075 

2725.00 

68.0 

36.60 

0.133 

0.162 

0.435 

0.421 

0.014 

3309.00 

71.0 

39.25 

0.151 

0.194 

0.520 

0.557 

-0.037 

2790.00 

64.0 

36.75 

0.106 

0.165 

0.446 

0.450 

-0.004 

2766.00 

62.0 

38.50 

0.092 

0.185 

0.440 

0.478 

-0.038 

5237.92 

80.0 

43.00 

0.203 

0.233 

0.719 

0.705 

0.014 

3149,82 

67.0 

37.60 

0.126 

0.175 

0.498 

0.490 

0.008 

5498.46 

79.0 

44.60 

0.198 

0.249 

0.740 

0.739 

0.001 

3397.83 

66.0 

38.00 

0.120 

0.180 

0. 531 

0.641 

-0.010 

3007.56 

62.0 

36.80 

0.092 

0.166 

0.478 1 

0.486 

-0.008 

4574.29 

79.0 

41.10 

0,198 

0.214 

0.660 

0.596 

0.064 

6228.69 

73.0 

48.00 

0.163 

0.281 

0.794 

0.780 

0.014 

2318.64 

63.0 

30.20 

0.099 

0,080 

0.365 

0.265 

0.100 

3176.71 

68.0 

37.76 

0.133 

0.177 ! 

0.602 

0.502 

0 

2362.31 

70.0 

32.50 

0.145 

0.112 

0.371 

0.363 

0.008 

2174.44 

69.0 

31.62 

0.139 

0.100 

0.337 

0.333 

0.004 

2694.72 

73.0 

34.60 

0.163 

0.138 

0.431 

0.433 

-0.002 

3333. 53 

70.0 

37.25 

0.145 

0.171 

0.523 

0.500 

0.023 

4328.92 

78.5 

40.00 

0.195 

0.202 

0.636 

0.617 

0.019 

2115.04 

67.0 

31.25 

0.126 

0.095 

0.325 

0.317 

0.008 

2489. 08 

66.5 

33.75 

0.123 

0.128 

0.396 

0.388 

0.008 

2296.65 

64.5 

32.38 

0.110 

0.110 

0.3G1 

0.338 

0.023 

3117.21 

65.5 

37.58 

0.116 

0.175 

0.494 

0.480 

0.014 

4088. 36 

74.0 

40.33 

0.169 

0.206 

0.612 

0.602 

0.010 

4180. 88 

72.0 

40.50 

0.157 

0,207 

0.621 

0.594 

0.027 

2318. 19 

63.0 

33.00 

0.099 

0.119 

0.365 

0.346 

0.019 

1946. 90 

58.0 

31.00 

0.063 

0.091 

0.289 

0.255 

0.034 

2479. 89 

61.0 

36.50 

0.086 

0.162 

0.394 

0.423 

-0.029 

3174. 80 

73.0 

37.00 

0.163 

0.168 

0.502 

0.506 

-0.004 

2151.54 

64.0 

33.00 

0.106 

0.119 

0.333 

0.353 

-0.020 

3475. 68 

73,0 

39.50 

0.163 

0.197 

0.541 

0.576 

-0.035 

4393. 08 

71.0 

42.00 

0.151 

0.223 

0.643 

0.624 

0.019 

2819.50 

69.0 

35.00 

0.139 

0.144 

0.450 

0.432 

0.018 

3703 49 

70.0 

38.50 

0.146 

0.185 

0.569 

0.530 

0.039 

2742.81 

72.5 

34.50 

0.160 

0.138 

0.438 

0.430 

0.008 


>i< X 2 = logio (circumference) — 1.700, stated to three decimal places. 

Xz = logio ("over”) — 1.4, stated to three decimal places. 

Xi = logio (volume) — 3.0, stated to three decimal places, 
t Acknowledgment is due W. H. Hosterman, of the Bureau of Agricultural Economics, U. S. 
Department of Agriculture, for the use of these data. 



JOINT FUNCTION FOR TWO INDEPENDENT VARIABLES 379 


TABLE 79 — Continued 


Volume, 
in cubic 
feet 

Circum- 
ference, 
in feet 

‘‘Over,” 
in feet 

Aa* 

Xb* 

Xi* 

xi 

z 

3002.40 

66.0 

35.50 

0.120 

0.150 

0.477 

0.430 


1854.19 

69.0 

30.50 

0.139 

0.084 

0.268 

0.297 


1982.07 

62.0 

31.00 

0.092 

0.091 

0.297 

0.288 

WSmi 

2470.86 

65.0 

33.50 

0.113 

0.125 

0.393 

0.373 

■ESI 

1203.15 

60.1 

26.25 

0.079 

0.019 

0.080 

0.117 


2843.84 

71.0 

36.00 

0.151 

0.156 

0.464 

0.469 


2636.25 

66.0 

36.00 

0.120 

0.156 

0.421 

0.443 


1998.39 

65.0 

32.00 

0.113 

0.105 

0.301 

0.330 

Bm 

2005.03 

64.0 

32.00 

0.106 

0.105 

0.302 

0.323 


2568.76 

66.0 

35.00 

0.120 

0.144 

0.410 

0.418 


2161.18 

65.0 

32.50 

0.113 

0.112 

0.335 

0.345 


2112.20 

67.0 

32.00 

0.126 

0.105 

0.325 

0.333 


3009.33 

65.0 

38.00 

0.113 

0.180 

0.478 

0.438 


1992.24 

63.0 

31.00 

0.099 

0.091 

0.299 

0.288 

0.011 

2746.98 

70.0 

34.00 

0.145 

0.131 

0.439 

0.407 

0.032 

2238.27 

64.0 

35.00 

0.106 

0.144 

0.350 

0.406 

-0.066 

1747.47 

67.0 

30.00 

0.126 

0.077 

0.242 

0.280 

-0.038 

2863.91 

67.0 

36.00 

0.126 

0.156 

0.457 

0.448 


3593.47 

72.0 

39.00 

0.157 

0.191 

0.555 

0.555 

0 

2435.48 

62.0 

35.00 

0.092 

0.144 

0.387 

0.443 

-0.056 

2430.18 

63.0 

34.00 

0.099 

0.131 

0.386 

0.362 

0.024 

2590.07 

67.0 

36.00 

0.126 

0.144 

0.413 

0.423 


3577.68 

70.0 

41.00 

0.145 

0.213 

0.554 

0.596 


3299.24 

73.0 

40.00 

0.163 

0.202 

0.518 

0.598 


1986.14 

64.0 

32.50 

0.106 

0.112 

0.298 

0.338 


3109.04 

68.0 

38.00 

0.133 

0.180 

0.493 

0.508 


2821.56 

71.0 

37.00 

0.151 

0.168 

0.450 

0.498 

-0.048 

2932.24 

67.0 

38.00 

0.126 

0.180 

0.467 

0.501 


3304.63 

69.0 

38.00 

0.139 

0.180 

0.519 

0.514 


2565.46 

72.0 

36.00 

0.157 

0.144 

0.409 

0.450 

-0.041 

4509.93 

74.0 

41.33 

0.169 

0.216 

0.654 

0.627 

mm 

4804.01 

81.0 

42.00 

0.20S 

0.223 

0.682 

0.683 

BH 

4241.80 

75.0 

40.75 

0.175 

0.210 

0.627 

0.619 

mm 

4516.10 

69.2 

43.25 

0.140 

0.236 

0.655 

0.643 

0.012 

5011.62 

77.5 

43.10 

0.189 

0.234 

0.700 

0.691 

msm 

2110.73 

65.0 

31.50 

0.113 

0.098 

0.324 

0.316 

0.008 

2775.70 

76.0 

34.60 

0.181 

0.139 

0.443 

0.448 


3927.90 

72.0 

39.00 

0.157 

0.191 

0.594 

0.555 


4212.77 

80.0 

41.50 

0.203 

0.218 

0.624 

0.663 


3562.64 

78.5 

38.50 

0.195 

0.185 

0.552 

0.575 

-0.023 


* X '2 loKio (oirouniference) — 1.700, stated to three derinuil places. 
Xz - loRio ("over”) — 1.4, stated to three decimal places. 

Xi - logio (volume) — 3.0, stated to three decimal places. 




380 


JOINT FUNCTIONAL REGRESSION 


TABLE 79 — Continued 


Volume, 
in cubic 
feet 

Circum- 
ference, 
in feet 

“Over,'^ 
in feet 

Xi* 

X3* 

Xi* 

xi 

z 

2853.96 

75.0 

35.50 

0.175 

0.150 

0.455 

0.461 

-0.006 

3294. 38 

69.0 

38.00 

0.139 

0.180 

0.518 

0.514 

0.004 

1689.54 

63.0 

30.50 

0.099 

0.084 

0.228 

0.274 

-0.046 

2228. 84 

62.0 

33.00 

0.092 

0.119 

0.348 

0.341 

0.007 

2362. 61 

64.0 

34.00 

0.106 

0.131 

0.373 

0.379 

-0.006 

3088. 28 

68.0 

38.50 

0.133 

0.185 

0.490 

0.520 

-0.030 

3820. 79 

70.0 

40.00 

0.145 

0.202 

0.582 

0.570 

0.012 

3126.64 

63.0 

36.90 

0.099 

0.167 

0.495 

0.447 

0.048 

3624. 76 

71.0 

38.46 

0.151 

0.185 

0.559 

0.536 

0.023 

3023.97 

73.0 

36.50 

0.163 

0.162 

0.480 

0.493 

-0.013 

6045. 42 

79.0 

47.00 

0.198 

0.272 

0.781 

0.798 

-0.017 

3100. 11 

64.0 

37.00 

0.106 

0.168 

0.491 

0.457 

0.034 

3378. 07 

70.0 

38.00 

0.145 

0.180 

0.529 

0.519 

0.010 

3040. 29 

77.0 

35.00 

0.186 

0.144 

0.483 

0.464 

0.019 

2252.16 

65.0 

32.50 

0.113 

0.112 

0.353 

0.345 

0.008 

3552.61 

76.0 

37.00 

0.181 

0.168 

0.551 

0.481 

0.070 

2635.90 

66.0 

34.50 

0. 120 

0.138 

0.421 

0.405 

0.016 

3201.41 

71.0 

35.50 

0.151 

0.150 

0.505 

0.455 

0.050 

2590.21 

69.0 

35.00 

0.139 

0.144 

0.413 

0.432 

-0.019 

3743.55 

76.0 

38.25 

0.181 

0.183 

0.573 

0.558 

0.015 

3858.03 

73.0 

39.50 

0.163 

0.197 

0.586 

0.576 

0.010 

3829.44 

74.0 

39.75 

0.169 

0.199 1 

0.583 

0.586 

-0.003 

2556.44 

66.0 

33.00 

0.120 

0.119 

0.408 

0.365 

0.043 

3119.07 

69.0 

36.00 

0.139 

0.156 

0.494 

0.460 

0.034 

2122.38 

65.5 

32.00 

0.116 

0.105 

0.327 

0.332 

-0.005 

2921.92 

69.0 

36.00 

0.139 

0.156 

0.466 

0.460 

0.006 

2936.35 

72.5 

34.50 

0.160 

0.138 

0.468 

0.430 

0.038 

2427.66 

76.0 

33.00 

0.181 

0.119 

0.385 

0.399 

-0.014 

2069.38 

65.0 

31.50 

0.113 

0.098 

0.316 

0.315 

0.001 

1809.54 

72.0 

30.00 

0.157 

0.077 

0.279 

0.285 

-0.006 

4289.28 

78.5 

40.50 

0.195 

0.207 

0.632 

0.629 

0.003 

2407.39 

67.5 

32.50 

0.129 

0.112 

0.381 

0.358 

0.023 

3097.99 

66.0 

35.50 

0.120 

0.150 

0.491 

0.430 

0.061 

3893.67 

75.5 

39.25 

0.178 

0.194 

0.590 

0.582 

0.008 

2238.66 

68.0 

31.75 

0.133 

0.102 

0.350 

0.336 

0.014 

2314.79 

64.0 

33.10 

0.106 

0.120 

0.364 

0.356 

0.008 

2667.07 

66.0 

34.70 

0.120 

0.140 

0.426 

0.409 

0.017 

2582.07 

68.0 

33.50 

0.133 

0.125 

0.412 

0.388 

0.024 

3126 . 50 

75.0 

37.00 

0.175 

0.168 

0.535 

0.516 

0.019 

2307.34 

60.0 

33.40 

0.078 

0.124 

0.363 

0.336 

0.027 

3960.41 

76.0 

39.30 

0.181 

0.194 

0.598 

0.585 

0.013 


•' X 2 = logio (circuniference) — 1.700, stated to three decimal places. 
Xg = logic (“over”) — 1.4, stated to three decimal places. 

Xi = logic (volume) — 3.0, stated to three decimal places. 



JOINT FUNCTION FOR TWO INDEPENDENT VARIABLES 381 


The logarithms (to base 10) are accordingly also shown in Table 79, 
and are designated as X2, X3, and (To facilitate the subsequent 
computations, 1.7 has been subtracted from the logarithm for circum- 
ference, 1.4 from the logarithm for “over,” and 3.0 from the logarithm 
for volume.) 

Subgrouping and averaging the observations. The first step in the 
process of determining the joint functional relation is to classify the 
observations according to X2, and subclassify according to X3, and de- 
termine the averages of Xi, X^, and X3 for each group. Since there are 
only 120 observations, it would not be w^orth while to make too many 
groups. Four groups each way would give 16 subgroups, and 5 each 
way would give 25. If the cases were uniformly distributed through 25 
subgroups, that would make less than 5 cases to a group, which is 
rather thin for a satisfactory average (though it might be sufficient in 
this particular problem, where the correlation is much higher than in 
many problems which must be dealt with.) However, the cases will 
not necessarily be distributed uniformly through all the groups, so it 
will be best if we try the fivefold classification and see how the cases 
fall. 

TABI.E 80 


Number of Haystack Observations, Classified According to X2 and Xz 
(Logarithms of Circumference and “Over”) 


Xz values 

A ’'2 values 

Under 

0.090 

0.090- 

0.119 

0.120- 

0.149 

0.150- 
0. 179 

0.180 
and over 

Under 0.100 

2 

7 

3 

1 


0. 100- 0.139 

1 

14 

10 

3 

2 

0.140-0.179 

1 

8 

17 

8 

2 

0.180-0.219 

1 

2 

10 

14 

7 

0.220 and over 



1 

2 

5 


There is a marked correlation between ^2 and so a few groups 
have 10 or more reports, whereas 15 out of the 25 have under 5. Pre- 
liminary examination of the data indicates that a unit change in X3 is 
generally accompanied by a larger change in than is a unit change 
in X 2 . Accordingly we may decide to halve the groups in the central 
portion of the range of X3, making the class intervals witli respect to 
that variable under 0.100, 0.100 — 0.119, 0.120 — 0.139, 0.140 — 0.159, 




382 


JOINT FUNCTIONAL REGRESSION 


0.160 ~ 0.179, 0.180 - 0.199, 0.200 ~ 0.219, and 0.220 and over. With 
5 classes for X 2 , this will give a 40-group classification — but with many 
of the “cells” vacant. Averaging X2, X3, and for each of the 
resulting groups gives means as shown in Table 81. 

Plotting the subgroup averages and drawing first approximation 
curves. Inspection of the averages of down each column in 
Table 81 shows that most of the variation in that factor has been elimi- 
nated, except in the upper subgroups of Z3, above X3 = 0.200, where 
the averages tend to fall above the mean of the range. There is a more 
marked tendency for the averages of X3 to rise across the rows from left 



Avera^^c 

Fig. 65. Differences in Xi, with differences in X3 for specified values of X2, and 
first approximate curves. 


to right. Accordingly the groups classified with respect to X 2 will be 
studied first, to determine the changes in with changes in Z3, X 2 
being held (approximately) constant at various values. This may be 
done by plotting separately the average difference in Zj with differences 
in Z3, for each column. Figure 65 shows these averages, with the 



PLOTTING THE SUB-GROUP AVERAGES 


383 


TABLE 81 


Haystack Data: Average X2, X3, and Xi, eor Observations Classieibd by 

X 2 AND Xs 


Xi values 

Number of 

cases 

X 2 under 0.090 

Mean X 2 

Mean Xs 

Mean Xi 

Under 0.100 

2 

0.071 

0.055 

0.185 

0.100-0.119 





0.120-0.139 

1 

0.078 

0.124 

0.363 

0.140-0.159 





0.160-0.179 

1 

0.086 

0.162 

0.394 



X 2 0.090-0.119 

Under 0.100 

7 

0.102 

0.081 

0.278 

0.100-0.119 

10 

0.107 

0.112 

0.332 

0.120-0.139 

4 

0.106 

0.127 

0.379 

0.140-0.159 

2 

0.099 

0.144 

0.369 

0.160-0.179 

6 

0.105 

0.167 

0.473 

0.180-0.199 

2 

0.103 

0.183 

0.459 



X 2 0.120-0.149 

Under 0. 100 

3 

0.130 

0.085 

0.278 

0.100-0.119 

6 

0.132 

0.108 

0.362 

0.120-0.139 

4 

0.130 

0.131 

0.417 

0.140-0.159 

12 

0.129 

0.149 

0.440 

0.160-0.179 

5 

0.135 

0.171 

0.483 

0.180-0.199 

8 

0.135 

0.181 

0.515 

0.200-0.219 

2 

0.145 

0.208 

0.568 

0 . 220 and over 

1 

0.140 

0.236 

0.655 



X 2 0.150-0.170 

Under 0 . 100 

1 

0.157 

0.077 

0.279 

0.100-0.119 





0.120-0.139 

3 

0.161 

0.138 

0.446 

0.140-0.159 

4 

0.159 

0.150 

0.456 

0.160-0.179 

4 

0.163 

0.167 

0.492 

0.180-0.199 

9 

0.161 

0.193 

0.558 

0.200-0.219 

5 

0.167 

0.208 

0.606 

0 . 220 and over 

2 

0.1 S 7 

0.252 

0.719 



X 

2 0.180 and over 

0.100-0.119 

1 

O.lSl 

0.119 

0.385 

0.120-0.139 

1 

0.181 

0.139 

0.443 

0.140-0.159 

1 

0.186 

0.144 

0.483 

0.160-0.179 

1 

0.181 

0.168 

0.551 

0.180-0.199 

3 

0.186 

0.187 

0.674 

0.200-0.219 

4 

0.198 

0.210 

0.638 

0 . 220 and over 

5 

0.199 

0.242 

0.724 




384 


JOINT PUNCTIONAL REGRESSION 


number of observations represented by each one indicated. Apparently 
the relation, for each group of averages, tends to be linear, so straight 
lines are drawn in by eye as first approximations to the final relation. 
It should be noticed, however, that these lines are not all of the same 
slope but tend to slope more steeply as X 2 increases. In some problems 
curves instead of straight lines would be indicated by these group aver- 
ages. In such a case, separate curves would be fitted freehand to each 
set of averages. In drawing such curves it is desirable to keep them as 
nearly of the same shape as the data will permit, and to change the 
shape only gradually from one to the next. 

Obtaining a second approximation to the joint surface. After the 
first approximation lines or curves have been drawn along the axis, 
the next step is to smooth along the X 2 axis. To do this the values of 
Xi according to the first approximation curves are read off at intervals 
on Z3, corresponding to the central values of the groups of X3 in 
Table 81. These values are shown in Table 82. 

TABLE 82 

Estimated Values of Xi for Specified Values of Z2 and X3, from First 
Approximation Lines 


X 3 



Xi 



Under 0.090 

0.090-0.119 

0.120-0.149 

0.150-0.179 

0 . 180 and over 

0.060 

0.110 

0.195 

0.307 

0.233 

0.334 

0.247 

0.359 

0.354 

0.382 

0.130 

0.352 

0.374 

0.403 


0.433 

0.150 

0.397 

0.415 

0.448 

0.454 

0.484 

0.170 

0.442 

0.455 

0.492 

0.505 

0.535 

0.190 


0.496 

0.537 

0.555 

0.586 

0.210 

0.240 



0.583 

0.650 

0.605 

0.681 

0.637 

0.713 


The readings shown in Table 82 may next be smoothed along the X^ 
axis by plotting the estimated value of X-^ for the specified values of Ay, 
with varying values of A"2. This process is shown in Figure 66. The 
values used as the coordinates for X 2 are the corresponding average 
values of X 2 for each subgroup in Table 81. Thus in plotting the values 
of Xi for X3 = 0.060 — the first line in the table above — the values for 
X 2 are the averages from the first lines in Table 81 — 0.071, 0.102, 
0.130, etc. By using these actual averages allowance is made for 
cases such as those noted previously, where not even the process of 









ESTIMATING Xi FROM THE JOINT FUNCTION 


385 


subgrouping has completely removed the influence of the other inde- 
pendent variable. 

The averages plotted in Figure 66 show a slight but consistent curvi- 
linear relation of to X 2 , with a gradually increasing slope for the 
higher values. In two cases straight lines would fit as well as curves, 
but, since all the remaining groups show consistent slight curves, they 
are drawn in here as well. The averages, smoothed freehand along the 
^^2 £^xis, give a second approximation to the joint functional relation. 

Making the final smoothing of the approximation curve. As a final 
check, values from the curves in Figure 66 may be read off for stated 



Aver^^c^Xj 

Fig. 66 . Differences in Xi with differences in X 2 for specified values of X 3 , read 
from smoothed curves in Fig. 65. 

values of A'o, and smoothed again with reference to Xs- Since the 
variation in the averages of X 2 in each column of Table 81, and in the 
averages of A's in each row, have now been allowed for by the methods 
used in constructing Figures 65 and 66 , the new readings may be taken 
for any convenient interval on A 2 . The values 0.070, 0.080, 0.100, 
0 . 120 , 0.140, 0.160, 0.180, and 0.200 may be used, giving convenient 
values for subsecjuent interpolations. Reading off the corresponding 
A'i values from the curves in Figure 66 , just shown before in Table 82, 
and plotting with the A 3 values as abscissas, gives the results shown in 
Figure 67. 




386 


JOINT FUNCTIONAL DEGRESSION 


i 


After the readings from Figure 66 are plotted, as indicated by 
the hollow circles in Figure 67, they are smoothed with i*eference to the 
X3 axis. It is found again that straight lines serve to describe the 
relation, and these are accordingly drawn in by eye, with some con- 
sideration of adjacent lines where otherwise the line would be out of 
agreement, as for X 2 = 0.200. These lines, showing the estimated 
values of Xi for specified differences in X 2 and Z3, may be taken as 
defining the functional relation between the three variables. This 
figure indicates very clearly the “warping” of the regression surface. 
The increase in Xi per unit increase in X3 is much greater for large 
values of X 2 than for small values. That fact could not have been 
expressed in any regression equation of the form Xi =/2(X2) +■ 
/3(X3). The shape of the “correlation surface” may be seen in Figure 



Fiq. 67. Differences in Xi with differences in X 3 , from second smoothed curves. 


68, where the final lines from Figure 67 have been combined into a 
three-dimensional diagram. 

Estimating from the joint function. Estimates of X^ for any 
combination of values of X2 and X3 may be made directly from Figures 
67 or 68 by making the necessary interpolations. The process may 
be more conveniently carried out by making a “contour chart,” which 
shows the differences in X^, for different combinations of X^ and X3, 



DETERMINING THE STANDARD ERROR AND CORRELATION 387 

by a series of lines passing through combinations of the other two 
variables which will produce equal values of Xj. Thus if a series of 
planes were passed through the cube in Figure 68, parallel to the base 
plane, at Xi = 0.300, 0.400, 0.500, etc., they would cut the top surface 
of the solid in the intersections indicated by the dotted lines. Then 
if one were to look straight down upon the top of the solid, these dotted 
lines would appear as shown in Figure 69. Other lines have been 
drawn in between these to indicate 0.050 differences in Xi.® 



Fia. 68. Probable vahio of A'l for specific combinations of Xo and X3, from 

smoothed surface. 


Determining the standard error of estimate and index of multiple 
correlation. Est,iniak*(l yields for each observation may now be worked 
out by the use of Figure G7 or 69, interpolating for the distances be- 
tween the adjacent lines where an observation falls off the line. Table 
79 also shows these estimated values (A"''), and the difference between 
the actual and the estimated. The standard deviation of the original 


^ Figure 69 may he mowt readily drawn from Figure 67. By noting the value of 
X 3 necessary to produce a value of Xi =0.300, for each X 2 line, a series of values 
for X 2 and X 3 may be located, through which the contour for Xi = 0.300 is drawn. 
The values for ^1 = 0.400 arc then noted, giving the location of the 0.400 con- 
tour, and so on. 


388 


JOINT FUNCTIONAL REGRESSION 


values of is 0.1265, whereas the standard deviation of the residuals 
computed in Table 79 is 0.0295. Apparently the regression surface 
accounts for almost all the variation in volume. The accuracy with 
which estimates of Xi may be made from X 2 and X 3 may be de- 
termined by adjusting or* in the usual way. 

Values 



Fig. 69- Probable value of A"i for specific combinations of X <2 and X:\, shown by 

contours. 

For the type of surface shown, the relations might be quite closely 
represented by an equation of the type ^ 

= a + 62X2 + (63 + h^X2)X>, 

This equation expresses the relation shown in Figure 68: for a constant 
value of X2, the regression of on X3 is linear; but as X2 changes, 
this regression also changes at a uniform rate. The equation given has 

^ See page 408 for a further discussion of the possibilities of this type of formula- 



STATING THE CONCLUSIONS SHOWN BY JOINT FUNCTION 389 


4 constants, so in adjusting to determine the standard error of esti- 
mate, m = 4. Hence 

^a5l[/(*2®3)l ” 0.0300 

Similarly, the index of multiple correlation for Xi as a joint func- 
tion of X2 and Z3 may be computed in the usual manner: 






p2 = 1 


2 


(0.0295)® 

(0.1265)® 


0.000870 

0.01600 

= 0.9456, 

and adjusting for the number of constants, 

P = 1 - (1 - P®) 

\n — m/ 

-l-(. - 0.9456) 

= 0.9442 
P = 0.972 


It is evident that the volume of a round haystack may be very 
closely estimated from the rough farm measurements of circumference 
and “over.” The standard error of estimate, 0.0300, indicates that the 
logarithm of volume can be estimated to zt 0.0300 of the true logarithm 
for two-thirds of the observations and to dt 0.0600 of the true logarithm 
for 95 per cent of them. Taking the antilogarithms of 0.0300 and 
1.9700, and of 0.0600 and 1.9400, we find that that means the volume 
can be estimated to between 107.2 per cent and 93.3 per cent, or 
between 114.9 per cent and 87.1 per cent, of the true values, respec- 
tively, for the ])roportions stated. 

Stating the conclusions shown by the joint function. After the 
joint relation of one variable to two others has been determined by the 
method sketched, the final regression surface, as expressed in Figures 
67, 68, or 69, may be restated in simpler terms by preparing tables 



390 


JOINT FUNCTIONAL UEGRESSION 


showing the average or expected values of for stated combinations 
of X 2 and Z 3 . In this particular problem, where the surface was de- 
termined with respect to logarithmic values, that involves determining 
the logarithms of X 2 and Z 3 for the selected values, reading off from 
the charts the corresponding estimated value for the logarithm of Xi, 
and finding its antilogarithm. Carrying out this process, we obtain the 
values shown in Table 83. 

TABLE 83 

Average Volume of Round Haystacks for Different Combinations 
OF Circumference and “Over’^ 


Circum- 

‘‘Over,” in feet 

ference 







30 

34 j 

38 

42 

46 


Cubic feet 

Cubic feet 

Cubic feet 

Cubic feet 

Cubic feet 

60 feet 

1,730 

2,244 




65 feet 

1,871 

2,432 

3,097 



70 feet 

1,928 

2,553 

3,319 

4,150 


75 feet 


2,655 

3,524 

4,467 

5,623 

80 feet 




4,710 

6,026 


Determining joint influence of two independent variables, holding 
other independent variables constant. In many cases it may be 
desirable to allow for the joint influence of two variables while simul- 
taneously eliminating or holding constant the effect of one or more 
additional independent variables. In the corn problem it would be 
desirable to determine the relation of yield to rainfall and temperature 
jointly, while simultaneously allowing for the upward tendency in 
yield during the period studied. This may be done by determining 
the relation according to the equation 

Xi =/2.3(X2,X3)+/4(X4) (85) 

This relation may be worked out by combining the method just 
shown for determining a joint function for two independent variables 
with the method of successive approximation for handling many vari- 
ables, as discussed in Chapters 14 and 16. The essential steps are (1) 
to determine the curvilinear changes in Xi with changes in X2, X3, and 
Y 4 , according to the simpler equation, 

Xi == a +/2(X2) +/3(X3) +/4(X4) 



JOINT INFLUENCE OF THEEE OR MORE VARIABLES 391 


and then (2) to compute the residuals for each observation, using 
these curves, and subclassify the residuals according to the two variables 
for which the presence of a joint function is to be tested. If these 
averages of residuals indicate any significant warping of the surface, 
( 3 ) they are next smoothed by the method presented following Table 
81 . The residuals may then ( 4 ) be adjusted to take account of this 
joint relation in addition to the individual curvilinear relations pre- 
viously allowed for, and their standard deviation computed. If the 
variance has been significantly reduced, the residuals may then (5) 
be averaged with respect to the remaining independent factor, to see 
if the curve for that factor will be changed now that the joint relation 
to the other factors has been allowed for. If it is changed, the residuals 
are recomputed to see if any further change need be made in the 
joint function and the process continued until the final shape of the 
curve and joint surface is determined. 

Measuring correlation with respect to joint functions. The correla- 
tion may be measured with respect to joint functions just as before 
it was measured with respect to curvilinear regressions. The standard 
error of the residuals, adjusted for the estimated number of constants, 
indicates the standard error of estimate; and this adjusted standard 
error, substituted in equation (66.2) , gives the index of multiple correla- 
tion and of multiple determination. But since the combined influence of 
X2 and X3 is being determined, it is not possible to compute coefficients 
of partial correlation, or other measures of individual importance, for 
the variables which are being considered jointly. It would be possible 
to work out what portion of the variance in was accounted for by 
X2 and X3 together and how much by X4, but that would be all.® It is 
something of a guess how many constants should be allowed for in com- 
puting the correlation and the standard error. It will be higher than 
for the individual curves for /2 (X2) and (X3) , in general. If the joint 
relation merely involves a gradual regular shifting of the slope of the 
curves across the surface, one additional constant would be enough; if it 
involves a shifting at an increasing rate, two might be assumed; and if 
it involves several changes in shift, even more might be needed. 

Determining joint influence of three or more independent variables. 
The methods just described might be used to determine several joint 

5 This would involve determining, by least squares, the equation 
Xi = a + b2[/2..3(X2, Xz)]+ hlUiXd] 

The separate determination with respect to b 2 would then indicate the determina- 
tion by X2 and X 3 combined. See Note 11 , Appendix 2. 



392 


JOINT FUNCTIONAL REGRESSION 


relations at the same time, each relation involving two independent 
variables. Thus if X 2 = rain in July, X3 = temperature in July, 
X4 = rain in August, = temperature in August, and Xq = time, 
the yield of corn might be explained by a set of relations represented by 
the equation 

= /2.3(X2, X 3 ) + /4.6(X4, X 5 ) + MXe) (86) 


The functions would be determined by first getting the net regression 
curves for each factor separately, then the joint curves for f 2 , 3 (^ 2 ) 
X3) and / 4 , 5 (X 4 , Xq) by classifying the residuals by the method just 
described, and then determining the final shapes by successive approxi- 
mations. But there will be some cases where even so flexible a relation 
as represented in equation (86) will not be sufficient really to represent 
the relations. For example, yield might depend jointly on rainfall, 
temperature, and length of growing season, and a change in any one 
factor might cause differences in the effects of others as well. Such a 
relation would be represented by such equations as 

Xi = /(X2, X3, X4, . . . X.) (87) 

To determine the shape of such a function for even three inde- 
pendent variables would require a large number of observations, since 
a threefold subclassification would be needed. If only 4 classes were 
used for each variable, 64 subclasses would be possible. Not unless 
there were sufficient observations so that say 3 to 5 might fall in each 
class, on the average, could such a relation be determined with any 
degree of accuracy, unless the correlation was very high indeed. If the 
joint correlation were perfect, one case to a subclass would be sufficient 
to indicate the nature of the function. 

With three independent variables, successive smoothing in three 
dimensions would be involved. Where an adequate number of obser- 
vations was available, the process might be simplified by dividing the 
observations into several groups according to one variable, determin- 
ing the functional relation to the othei two independent variables 
separately for each group, and then smoothing the results for the 
different groups together to determine the change in joint function 
with changes in the first variable. 

Figure 70 illustrates some results of this sort, for a four-dimen- 
sional joint function. These results were obtained from an analysis 
of 190 observations of sales of individual lots of apples. The records 
were first separated into those for each of the 5 sizes of apples, and 
the joint functional relation of price to amount of insect injury and 



JOINT INFLUENCE OF THREE OR MORE VARIABLES 393 

amount of scab determined separately for each size. These results 
were then smoothed between apples of different sizes, to make the 
'^surface'' of the imaginary four-dimensional solid diagram show a 
gradual continuous change over every dimension.® 


Z'A inch apples 



3 inch apples 3!^ inch eipplcs 

Fia. 70. Average price of apples of given sizes, for various combinations of amount 
of insect injury and amount of scab. 





3 VI inch apples 





While, of course, it is not possible to draw a single diagram ex- 
pressing the four-dimensional relationship 

= fiX.„ X„ X,) 

®This is done by reading expected prices for 0 scab, 0 insect injury, for apples 
of each size, and smoothing that series; reading for 0 seal:), 20 per cent inscjct injury, 
and smoothing that series; and so on until every portion of the surface has been 
smoothed with respect to the third independent variable. The smoothed values 
could then be read off and smoothed again in other dimensions, until the final 
continuous function was obtained. This illustration is from an analysis su])plicd 
by Frederick V. Waugh. For a more elaborate study of the same tyi)e, see John R. 
Raeburn, Joint correlation applied to the quality and price of McIntosh ai)ple£^ 
Cornell University Agricultural Experiment Station Memoir 220. March, 1939. 




394 


JOINT FUNCTIONAL REGRESSION 


the relation may be visualized by a composite diagram, as illustrated 
in Figure 70. This figure in particular illustrates the significant 
relations brought out by the joint functional treatment. Thus it is 
seen that large apples, with neither scab nor insect injury, sold for 
a material premium over perfect apples of small size; but that if the 
apples were badly damaged it did not make much difference what size 
they were. This may be stated another way — ^the presence of defects 
reduced the price of large apples much more than the price of small 
ones. The figure also shows that the presence of either defect alone 
reduced the value of apples of any size materially, whereas the presence 
of both defects together reduced the price only slightly more. Thus 
for 3~inch apples, apples with 0 scab and 0 insect injury sold for 
$1.05; those with 0 scab but 100 per cent insect injury, for $0.66; those 
with 100 per cent scab and 0 insect injury, for $0.65; and those with 
100 per cent scab and 100 per cent insect injury for $0.42. Increasing 
the insect injury from 0 to 100 per cent reduced the price 39 cents 
for apples with no scab, and only 23 cents for those with 100 per cent 
scab. Likewise for apples with neither scab nor insect injury 3-inch 
apples sold for $1.05, and 2i/4-inch ones for 75 cents; whereas for apples 
of these two sizes with 100 per cent of both injuries, the prices were 42 
cents and 39 cents, respectively. These comparisons show what a 
difference the recognition of joint relations may make in research con- 
clusions, and how important may be the resulting differences in the 
statement of relations. 

Theoretically there is no limit to the number of variables which 
could be considered jointly. The only practical limitation is the num- 
ber of observations available. Where it is possible to determine the 
joint relation, that affords by far the most satisfactory statement of 
the relationship, since then the real relation is not obscured by the 
assumptions hidden in the regression equation used. When a limited 
number of observations precludes a full recognition of joint relations, 
the use of mathematical transformations such as logarithms, logical 
grouping of the variables to determine the combinations of variables 
for which joint relations are most likely to obtain, and trial study of 
the residuals will serve to make the final regression equation set forth 
the true nature of the relations as closely as possible from the limited 
evidence available. 

Just as the accuracy of net regression curves depends largely on 
the number of observations along various portions of the curve and 
the standard error of estimate, so the reliability of a joint regression 
surface (such as that shown in Figure 68) would depend on the standard 



SUMMARY 


395 


error of estimate and the number of cases falling within the selected 
portion of the area. Where the joint regression surface is determined 
mathematically, its reliability can be estimated by an extension of the 
same equations presented in Chapters 18 and 19. Methods of estimat- 
ing the standard errors of a surface determined graphically have not 
yet been developed. 

Summary. This chapter has developed means by which the rela- 
tion between one variable and two others operating jointly may be 
determined, either where no other variables are concerned or where 
one or more additional independent variables are taken into account. 
Methods are also discussed for measuring the influence of three or 
more independent variables operating jointly; but the increased num- 
ber of observations necessary for such determinations restricts the field 
of usefulness of this type of analysis. 

REFERENCES 

Ezekiel, Mordecai. The determination of curvilinear regression ''surfaces” in the 
presence of other variables, Jour, Amer, Stat. Assoc., Vol. XXI, pp. 310-320. 
September, 1926. 



CHAPTER 22 


SUPPLEMENTARY METHODS FOR DETERMINING 
CURVILINEAR AND JOINT RELATIONS 

Chapters 14, 16, and 21 have set forth means by which curvilinear 
regressions may be determined for functions either of the simpler type 

=/2(X2) +/3(X3) +U(X,) + ... +fn(Xn) 

or of the more complex joint type 

Xi=/(X2, Xs, 

In each case the methods were purely empirical and depended 
on a combination of freehand smoothing with successive approxima- 
tion to the best curve as the influence of other factors was gradually 
eliminated. In addition to the methods which have been presented, 
there are other techniques which have been suggested for considering 
even more complex relations. On the other hand, if a specific mathe- 
matical function is assumed, the curves may be determined by a more 
rigid process, using the principle of ^‘least squares.’^ This chapter 
presents some of these further methods, both for multiple curvilinear 
regressions of the simpler forms and for joint functional relations. 

Determining net regression curves by mathematical functions. 
After the shape of the several net regression curves has been determined 
by the successive approximation method, a definite mathematical state- 
ment of the several functions may be obtained by an extension of the 
method presented on pages 221 and 222. " The freehand curves would 
provide the basis for selecting functions which would fit the net shape 
of the regression curves fairly well, giving at least this empirical cri- 
terion as to what function to use. Applying this method to the egg 
problem mentioned on page 302, for example, the final curves indicate 
that a straight line is probably adequate to describe the net regression 
of price on X3, that a cubic parabola would probably be required to 
describe the net regression on X2, and that a second-degree parabola 
might be suflBcient to fit the net regression on X4. Accordingly, the 
equation 

Xi = a + 62X2 + 62' (Xl) + 62'' (X2) + 63X3 ^4X4 + 64/(X2 i 

396 



DETERMINING NET REGRESSION CURVES MATHEMATICALLY 397 


might be fitted to the data. After the values for the seven constants 
were determined by the usual method for linear correlation, the close- 
ness with which the several mathematical curves fitted the net regres- 
sions could be judged by computing the residuals from the new regres- 
sion equation, and then plotting them as deviations from the several 
net regression curves, exactly as the residuals from the linear regres- 
sions were plotted in Figures 34, 35, and 36. If modifications in the 
fitted curves were found necessary, they could be determined by the 
approximation process again. 

Where- the original relations indicate a marked curvilinear rela- 
tion, as in Figures 33 and 41, the mathematical curves may be fitted 
right at the start, just as described above, and these curves used as 
the basis for subsequent corrections by the approximation method. 
Whether determining net linear regression, as illustrated in the corn- 
yield problem, or determining net curvilinear regressions, as just sug- 
gested, will prove the most expeditious way of beginning the suc- 
cessive approximation method will depend on the circumstances in 
individual problems. Thus if one regression is known to be markedly 
curvilinear, while the others are substantially linear, taking that 
curvilinearity into account in the equation may bring the linear re- 
gressions for all the other variables much closer to their final form, 
and so reduce the number of steps necessary in the successive approxi- 
mation determinations. 

Determining the curves by least squares. The process of deter- 
mining net regression curves by the use of a definite mathematical 
equation may be illustrated for the following data: 


A2 




A3 

A'l 

1 

3 

^8 

0 

2 

7 

2 

2 

10 

4 

6 

9 

4 

7 

8 

3 

3 

10 

9 

8 

9 

1 

2 

9 

5 

5 

10 

6 

5 

10 

2 

3 

9 

1 

2 

9 

2 

2 

9 

2 

2 

10 

7 

14 

7 

4 

14 

6 

9 

8 

9 

1 

2 

9 

2 

4 

8 

10 

11 

8 


Preliminary examination by the graphic method indicates that the 
net regressions of on both X 2 and X 3 may be approximately repre- 



398 


SUPPLEMENT AKY CURVILINEAR METHODS 


sented by parabolas. Accordingly, a net regression curve may be 
assumed of the type 

Xi = a +■ ^> 2^2 4" ^ 2 (^ 2 ) “H + ^sCXi) 

The arithmetic required to determine the five constants can be reduced 
by ^^coding” the squared values. Let U = X^/10, and V = Z3/IO. 
The regression equation then may be written 

Xi = a + 1 ) 2 X 2 + KU + 53X3 + 6,7 

The normal equations to determine the four constants are next obtained 
in exactly the same manner as described in Chapter 12 for a multiple 
correlation involving four variables. The resulting normal equations 
are: 

(Xxl)b2 + (I>X2u)bu + {Xx2X3)bs + {'Ex2v)hv = ^ 0 : 10:2 

(2x2u)b2 + (J>u^)bu + Xi{xzu)bz + = Xix^u 

(JiX2Xz)b2 + O^UXzlu + X{xl)bz + (ExzV)hy = ^XiXz 

(Ex2v)h2 + (^uv)hu + l^(xzv)hs + {'2v^)hv = 

Carrying out the required computations, the equations are found 
to be: 

I7I.75O62 + 170.6256,, + I65.OOO63 + 207.6006, = - 2.50 

170.62562 + 181.1656^ + 153.54063 + 192.3166, = - 5.31 

165.00062 + 153.5406^ + 295.2OO63 + 441.4806, = - 50.80 

207.6OO62 + 192.3166^ + 44I.48O63 + 696.0726, = - 86.52 

The (2a:?) = 24.20. 

Solving these equations by the usual method and computing a from 
equation (41) by restating it 

(^i.2ii3v “ 4^1 — — buMu — hzMz — h^Mv 

we find the regression equation to be 

Xi = 9.411 + 1.2709 X2 - 0.7337 U - 0.9957 X’3 + 0.33097 

The net regressions of X| on X2 and X3 are now shown by the two 
parabolic equations: 

Xi = 5.596 + 1.2709 X2 - 0.7337 X? 

Xi = 12.515 - 0.9957 X 3 + 0.03309 Xl 

The graph of these two curves is shown in Figure 71, 



DETERMINING NET REGRESSION .CURVES MATHEMATICALLY 399 

The (unadjusted) multiple correlation of Xi with X 2 , V, and V 
is 0.968. Since there are five constants in the regression equation, and 
20 observations, this gives (by equation [47]) an adjusted correla- 
tion of 0.963. This is then the index of multiple correlation of Xi with 
X 2 and X3, according to the parabolic regressions. The standard error 
of estimate, similarly adjusted, is found to be 0.101. 

It should be noted that where net curvilinear regressions are found 
by this method, the number of constants assumed in the regression 



02 4 - 6 8 10 le 

Xe 





Fig. 71. Parabolic regression curves, fitted simultaneously, and net residuals. 

equation is definitely known, and there can be no question as to the 
exact correction to apply to the computed correlation and standard 
error, or as to the probable significance of the observed correlation. 
On the other hand, the shape of tlie regression curve obtained is 
conditioned by the type of curve assumed; except where there is some 
logical basis for assuming a particular type of relation, the selection 
of the formula is still a purely empirical process. If the formula se- 
lected does not fit the data well, the resulting curves may fail to reveal 
the true relations. 




400 


SUPPLEMENTARY CURVILINEAR METHODS 


Testing the fit of the curves graphically. The extent to which 
mathematical net regressions fail to fit the data adequately may be 
investigated, in any particular problem, by the same graphic methods 
set forth in Chapter 14. To make this check, after the regression 
equation is determined for the particular curves selected, estimated 
values of Xi are calculated from the equation. The residual differences 
between Xi and these estimated values are then computed. These 
residuals are then plotted as departures from the mathematical net 
regression curves, in the same manner that the residuals from linear 
regressions were plotted as departures from the linear regressions 
in Chapter 14. Carrying this out for the problem illustrated, we obtain 
these results: 


Xi 

X3 

Xi 

Xi 

z 

Xa 

As 


X'l 

' z 

1 

3 

1 

8 

7.9 * 

0.1 

0 

2 

7 

6.7 

0.3 

2 

2 

10 

9.8 

0.2 

4 

5 

9 

9.2 

- 0.2 

4 

7 

8 

8.0 

0.0 

3 

3 

10 

9.9 

0.1 

9 

8 

9 

9.1 

- 0.1 

1 

2 

9 

8.8 

0.2 

5 

5 

10 

9.8 

0.2 

6 

5 

10 

10.2 

- 0.2 

2 

3 

9 

9.0 

0.0 

1 

2 

9 

8.8 

0.2 

2 1 

2 

9 

9.8 

- 0.8 

2 

2 1 

10 

9.8 

0.2 

7 

14 

7 

7.2 

- 0.2 

4 

14 

6 

5.9 

0.1 

9 

8 

9 

9.1 

- 0.1 

1 

2 

9 

8.8 

0.2 

2 

4 

8 

8.2 

- 0.2 

10 

11 

8 

7.8 

0.2 


The residuals obtained above are then plotted as departures from 
the parabolic net regressions, as also shown in Figure 71. It is evident 
in this case that the parabolic regressions represent the relations quite 
well, with the departures in general evenly distributed on both sides of 
each curve throughout their length. Only in the curve for Xi == /2 (X 2 ) 
is there any indication of failure to obtain a good fit. Here most of the 
individual observations lie slightly above the curve for values of X^ 
below 2. Above X 2 = 8 the individual observations do not agree with 
the downward turn of the parabola. Using a third-degree parabola 
for f 2 iX 2 )j which would mean adding a new term, b[^ (A’|), to the 
regression equation, would produce a better fit for this function. 

Where the graphic check on the adequacy of a mathematical net 
regression curve indicated that the functional relation was such tliat 
it could not be readily represented by a higher-order parabola or 
other simple mathematical expression, a freehand curve might be 
fitted to the residuals instead, and the final shape of the curves de- 



DETERMINING FINAL NET REGRESSION CURVES 401 

termined by successive approximations, just as described in Chapter 14, 
The determination of parabolic net regressions may thus be substi- 
tuted for the determination of linear net regressions as the first step 
in the successive approximation method of obtaining net regression 
curves. 

Any other type of mathematical function, the parameters of which 
can be expressed in the first degree, can be used to determine net regres- 
sions by the method of least squares. Besides still higher powers of X 2 , 
such transformations as IO/X 2 , lOO/Zl, log X 2 , and l/log X 2 may be 
employed as independent variables, either in place of the previous in- 
dependent variables or as an addition to the simple statement of them. 

Supplementary methods of determining the final shape of net re- 
gression curves. After the shapes of the several regression curves 
have been determined by the method of successive approximations, 
it is sometimes desirable to use the method of linear correlation to 
determine whether any further adjustment should be made in the 
slope of the several curves, to give the closest possible estimate of ’the 
Xi values. There are two alternative ways of doing this, yielding 
slightly different types of corrections. The first and simplest method 
is to correlate the final residuals, with the values of the several 
independent variables. That is, a new multiple correlation is run to 
determine the regression equation 

z"" = aa.234 + bz2.34X2 + hzZ,24X^ + ?>z4.23-X^4 (88) 

If significant values are obtained for any of the b’s, they indicate 
that the corresponding regression curve should be rotated counter- 
clockwise, if the net regression coefficient is positive; or clockwise if 
the coefficient is negative. The final values of the several functions will 
then equal the readings for the curves as previously determined, 
plus the additional linear correction. That is, if the final curvilinear 
multiple regression equation is to be 

Xx = k+MX2) +hiXs) +MX^) 
the several terms will be: 


7 "" I 

/C _ ttx 234 "T C^z.234 

f 2 {X,) = /^(Xa) + 6*2.34X2 
JsiXs) = ^3 (X 3 ) + bz3.2iX3 
UX4) = / 4 "(X 4 ) + 6,4.23X4 



402 SUPPLEMENTARY CURVILINEAR METHODS 

Since the intercorrelations between X2, X3, and X4 have already 
been computed in determining the original linear net regressions, much 
of the work required in determining the constants for equation (88) 
has already been performed, and the additional computation involved 
is not very heavy. 

A somewhat different type of correction is obtained by determining 
the regression equation 

Xi = + ?>12^3'4'[/2 (X2)] + &13'.2'4d/r(-^3)] 

+ ?>14'.2'3' (89) 

To compute the new constants required in equation (89), the functional 
readings corresponding to the independent variables are correlated with 
the original values of the dependent variable. Thus, if the values in 
Table 64, page 246, had been obtained from the final curves determined 
by the successive approximation process, the values read from the 
curves, shown in the fourth, fifth, and sixth columns, would have been 
substituted for the original independent variables in running the mul- 
tiple correlation with Xi. If X2, X3, etc., are used to represent these 
transformed values, the data to be correlated for the first four sets of 
observations would be: 


xi 

xi ' 

X[ 

Xi 

xi 

xi 


Xi 

7.4 

11.7 

12.3 

24.5 

8.4 

12.2 

12.2 

27.9 

7.9 

13.0 

11.8 

33.7 

8.8 

9.9 

12.2 

27.5 


If the net regression coefficients come out 1.0, in equation (89), that 
indicates that no change need be made in the curves. If any b comes 
out other than unity, however, the values read from the corresponding 
curve should be adjusted as indicated by the regression results. The 
adjustment may be worked out as follows: 

In the same way that /2'(X2) was used to indicate the values read 
from the final set of approximation curves, let /'2 (^2) represent the 
deviations of those readings for each variable from the average of all 
the readings for the particular variable. That is, for each observation 

f2\x2) =/2'(X2) 


The regression equation (89) may then be restated 

= ?>12'.3'4' [/2 (^ 2 )] + ^13' .2^4' [/^(^s)] + &14'.2'3' [fT {^i)] 



DETERMINING FINAL NET REGRESSION CURVES 


403 


and the corrected functions will be as follows: 

12 (^ 2 ) = 612 ' .3'4' [^ 2 '^^)] 

/sC^s) = &13'.2'4' U'zi^Zi)] 
f4:(x^) = fel4'.2'3' 

The difference between the two types of corrections is illustrated in 
Figure 72. Here the final curve for /4(X4), from the corn-yield prob- 
lem, has been plotted, and in addition it is shown as if a correction 
+ 0.50:4 had been worked out by the first method, equation (88) , or a 
correction 1.5 [F4 (0:4)] had been 
determined, by the second method, 
equation (89). It is evident that 
the first correction rotates the curve, 
so as to make its upward slope 
greater throughout, and its down- 
ward slope less; whereas the second 
correction merely expands the 
curve, making all the high values 
higher and all the low values lower, 
no matter where they fall with re- 
spect to X4. This is typical of the 
effect of these two types of correc- 
tions when applied to a curve of the 
type shown here. For curves which 
do not depart so far from a straight 
line and which either rise or fall through their entire length, the dif- 
ference between the two types of correction is less marked, as may 
readily be determined by experiment. For a strtiight line the correc- 
tion given by the two methods will tend to be identical. 

The Bruce adjustment. Besides the two methods shown of adjust- 
ing the final curves by linear correlation (the method of least squares) , 
there is a somewhat different adjustment of the final readings, termed 
the Bruce method after its originator.^ This method consists essen- 
tially of (1) constructing a dot chart showing the relation between 
the original values of and the values, estimated from the 
final set of curves (even after corrections such as those just men- 
tioned have been applied) ; (2) drjiwing in a curve showing average 
values of X^ for corrc'sponding values of and (3) using this 

^ See list of rof(‘ron(*es Jit ond of this eliMptnr. 



Fuj. 72. Two types of corrections to 
net regression curves. 




404 


SUPPLEMENTARY CURVILINEAR METHODS 


curve as the basis for making the final estimate. This method thus 
consists in finding the function 6 for the equation. 

Zi" = eiXT) (90) 

XT' = e[a + fTix^) + fT(x^) + fTiXd] (9i) 

The function 0 corresponds to the curve just described. 

The curve for 6 can be determined only after all the other fs are 
worked out. The average value of X^ is determined for each group of 
estimated values, X'[', treating X'” as if it were a single independent 
variable. Plotting the average values of both variables against each 
other, just as in Figure 23, a curve may be fitted freehand if it is indi- 
cated, to show the change in with changes in X'''. This curve is 
then the function 0 for equation (90) . 

This function may then be used to work out new estimates, X"/', 
which should have still higher correlation with X^ than the previous 
estimates. 

The Bruce adjustment is likely to prove of value in certain types 
of joint functions. Thus in the egg-price problem, it might be that 
when all the values of several factors, each of which by itself tended 
to lower the price, occurred in combination, the resulting price would 
be, on the average, even lower than the sum of the effects of each of 
the variables would indicate. On the other hand, it might be that 
when values of several factors, each of which would raise the price, 
occurred together, the price would not go quite as high as the sum 
of the probable effects of the several factors would indicate. Tlie 
Bruce adjustment thus makes it possible to determine one type of 
joint relation without the considerable extra work described in Chap- 
ter 21 for determining joint functions in general. 

Determining joint relations by contours. The method for determin- 
ing joint relations presented in the last chapter is essentially one of 
subclassification and then two-way smoothing of the resulting aver- 
ages, by successive smoothing for each of the two (or more) 
independent variables. A somewhat different method has been worked 
out by which a three-variable surface may be smoothed directly in 
both independent dimensions at the same time. The Waugh method 
is based on determining contours directly, instead of indirectly as 
described in the last chapter. In using this method, the averages of 
subgroups (either of original observations, as in Table 81, or of 
residuals) are plotted directly on a two-variable diagram, with 



DETERMINING JOINT RELATIONS BY CONTOURS 


405 


one independent variable as ordinate and the other independent 
variable as abscissa, and with the group averages used for the two 
independent variables. The average of the dependent variable (or 
residual) is then written in next to the dot which designates the 
subgroup. Figure 73 shows such a chart for the averages of Table 81. 
The next step is to connect averages of equal values by a continuous 
line, or, if none are the same, to run in contour lines which will enclose 
averages within the same limits. Thus the lines on the chart have 
been drawn so as to separate off the groups with Xi under 0.300, 



Fig. 73. Average values of Xt for various combinjifions of X 2 and A";}, and con- 
tours fitted directly to the data. 


between 0.300 and 0.400, from 0.400 to 0.500, etc. Once the shape 
and direction of these contours are determined, they may then be 
redrawn so as to keep a similar shape or a continuously changing 
shape, and an even or a regularly changing interval, across the whole 
surface. It is evident that Figure 73 is quite similar to Figure 69, 
determined by the other method. 

Where the correlation is high, so that the individual observations 
define the regression surface rather closely, the Waugh method may 
be used directly with the individual observations, plotting each ob- 
servation in the same way that the group averages were ]dotted in 
Figure 73. The following data illustrate this use of the method. 





406 


SUPPLEMENTARY CURVILINEAR METHODS 


Thci data from Table 84 are plotted in Figure 74, with the yield 
adjusted for trend used as the dependent factor. Drawing in con- 
tours so as to separate years of similar yields, we find that a very 
peculiar type of surface is indicated — one that changes elevation very 
rapidly between the combination of high early rainfall and low late 
rainfall, and high early rainfall and high late rainfall. When these 
results are used to forecast the yield in 1928 (which year, it will be 
noted, was not plotted or used in determining the contours) a yield 
of about 175 bushels is indicated. This is only in fair agreement 

Re^in-fa^U 



Fig. 74. Yield of potatoes for years of specified rainfall before August 1 and after 
August 1, and contours fitted directly to the data. 

with the final yield of 219 bushels, determined several months after 
the climatic data were available to give the forecast stated. 

Reading off the estimated values for each vear shown, the esti- 
mated adjusted yields as shown in the next to the last column of 
Table 84 are obtained. The standard deviation of the residuals, shown 
in the next column, is 10.6 bushels, whereas the cr of the yield adjusted 
for trend is 63.0. If five constants are assumed to be necessary 
to represent the surface mathematically, the standard error of estimate 
would be 13.0 bushels and the index of correlation for the surface 
indicated by the contours would be 0.98. If it is assumed that the trend 



DETERMINING JOINT FUNCTIONS MATHEMATICALLY 407 


line could be fairly accurately projected, the standard error of esti- 
mate indicates that an error as great as that in 1928 would be likely 
to occur only very rarely.^ The fact of high correlation and of low 
standard error could be judged directly from the closeness with which 
the contours fit the individual observations, in just the same way that 

TABLE 84 

Weather Conditions and Yield of Potatoes in Maine 


Year 

Rainfall to 
August 1 
(July 
doubled) 

Rainfall 
August 1 
to Sep- 
tember 15 

Yield 

Adjust- 
ment for 
trend * 

Yield 
adjusted 
for trend 

Estimated 

yield 

Residual 


Inches X 2 

Inches X 3 

Bushels Xi 

Bushels 

Bushels Xi 

Bushels 

KXs,X,) 

z 

1913 

13.17 

3.66 

220 

+26 

246 

248 

- 2 

1914 

11.33 

4.08 

260 

+27 

287 

260 

27 

1915 

15.96 

4.12 

179 

+31 

210 

229 

-19 

1916 

15.46 

3.77 

204 

-t-33 

237 

236 

1 

1917 

17.77 

5.53 

125 

+31 

156 

155 

1 

1918 

18.09 

3.87 

200 

+22 

222 

220 

2 

1919 

12.25 

5.41 

230 

+ 17 

247 

248 

- 1 

1920 

13.29 

7.62 

177 1 

+ 15 

192 

196 

- 4 

1921 

7.82 

6.11 

298 

+13 

311 

323 

-12 

1922 

16.40 

5.12 

187 

+ 12 

199 

197 

2 

1923 

10.61 

3.51 

258 

+ 9 

267 

278 

-11 

1924 

9.10 

6.13 

315 

+ 7 

322 

308 

14 

1925 

11.30 

5.38 

250 

+ 5 

255 

262 

- 7 

1926 

9.60 

5.00 

290 

+ 3 

293 

297 

- 4 

1927 

13.98 

6.02 

232 

+ 1 

233 

226 

7 

1928 

15.45 

6.45 

220 

- 1 

219 




* Simultuiufously (Icl.'Trninc^tl allowing for trend. See F. V. Waugh, Methods of fore- 

ooMting New England potato yields, U. S. Department of Agriculture, Bureau of Agricultural 
Economics, Mimeographed Report, February, 1929. 


closeness of the observations to the regression line indicates high cor- 
relation in the case of simple correlation. 

Determining joint functions by definite mathematical functions. 
In exactly the same way that definite equations can be deter- 
mined by the method of least squares to represent curvilinear net re- 
gressions, certain types of joint functional surfaces can be represented 

2 If the standard error of this particular estimate could be calculated along the 
lines indicated in Chapter 19, the error might not appear so unusual. 



408 


SUPPLEMENTARY CURVILINEAR METHODS 


by definite equations. The simplest type is that shown by the haystack 
volume problem in Chapter 21 , where the regression of Xi on X3 is 
substantially linear for any given value of X2 but where the slope of 
;the regression big, 2 changes as the values of X2 change. If it is assumed 
that the slope of 613.2 changes at a constant rate with changes in X2, 
this assumption may be expressed in the relation 

Xi — ffl -f- b(c “f* dX^X^ 

Multiplied out, it becomes 

Xi = Cl hcX^ “1“ bcZJf 2-^3 

which may be stated 

Xi = a + 6X3 + ^(XsXa) ( 92 ) 

The values of a, e, and g may then be determined by the usual methods 
of linear multiple correlation, with X3 and the values of the product 
(X2X3) used as the independent factors. 

If it is assumed that Xi varies with X2, other than through its in- 
fluence on bi3.2, an additional term may be added to the equation, 
making it 

Xi = a + eXa + (^(XsXs) + AX2 ( 93 ) 

Determining the values of the four constants of equation ( 93 ) from 
the haystack data, and working out estimated values of A^i for specific 
combinations of values of X2 and X3, we shall arrive at the same joint 
functional surface as was determined by the graphic method presented 
in Chapter 21 . 

We may extend the same method to n independent variables, 
assuming similar linear net regressions for Xi on each independent fac- 
tor, with the other independent factors constant at any given values 
and with these net regressions changing their slope progressively and 
uniformly as the other independent factors change. For three inde- 
pendent factors (four-dimensional space) the regression equation 
would be 

Xi = a “f 62X2 + 63X3 + 64X4 + C2(X2X3) 

+ C3(X2X4) + C4(X3X4) 

Determination of the seven constants would thus make possible a 
definite mathematical representation of a very complex set of relation- 
ships. 



DETERMINING JOINT FUNCTIONS MATHEMATICALLY 409 


If it is assumed (1) that the regression of X2 on for any given 
value of X3, is a curve and (2) that the slope of this curve changes at 
a changing rate as X3 changes, this assumption may be stated 

Xi ^a+f[a + e(Xz)]X 2 

This equation may be approximately represented by the following 
form: 

= a + f2(X2) + /2. 3(X2X3) + /aCXa) (94) 

Using X2, X3, and the product (X2X3) as the independent factors, we 
may determine the shape of the three functions by any of the methods 
presented previously. Then working out estimated values of Xi for 
various combinations of X2 and X3, we can determine very warped 
curvilinear surfaces for /(X2. X3). This last method is extremely 
flexible, and can be used to determine a wide variety of joint func- 
tional relations. It, too, may be generalized for n variables, with 
increasing numbers of observations. For three independent variables 
it would be 

Xi = a+/2( J2) +h(X 3 ) +MX4) +/2. 3 (X2X3) +/2. 4 (X2X4) +/3. 4 (X3X4) 

Although these methods do not reduce greatly the number of ob- 
servations required to determine joint functions, they do make it 
possible to apply the systematic procedure developed in Chapter 14 
and to judge more accurately the number of constants represented 
in the regression surface; and they enable the methods of Chapters 18 
and 19 to be applied in judging the reliability of the conclusions. 

The Court method. The Waugh method is essentially a way of 
simplifying the smoothing of the surface, while still leaving it primarily 
a graphic freehand process. Anotlier method, developed by Andrew 
Court, reduces the determination of joint functions to a more definite 
process, similar to the determination of the usual regression curves. 
This method depends upon a mathematical rotation of the surface of 
cubes such as those shown in Figures 62 to 64, so tliat, instead of 
averaging the values only when viewed with respect to the rectangular 
axes X2 and X3, we may also average tliem with respect to axes cut- 
ting across the surface at an angle. Though similar to the mathe- 
matical method just described, this method is applicable to a some- 
what different type of surface. 

The characteristic feature of the method is tlie use of composite 
functions which represent two or more independent variables. Thus 



410 


StJPPLEMENTAEY CURVILINEAE METHODS 


the regression surface in Figure 70, for apples of one size, might be 
expressed by the equation 

= MX 2 ) + fsiX^) + + X 3 ) (95) 

The effect of the introduction of the new composite element 
(X 2 + X 3 ) may be explained by working out wkat the values of this 
composite variable will be for various combinations of X 2 and X3. 
The following statement shows this in detail. 


Value of Composite Variable (X2 + X3), for Various Values of X2 and X3 


Xz values 

X2 values 

0 

20 

40 

60 

80 

100 

0 

0 

20 

40 

60 

80 

100 

20 

20 

40 

60 

80 

100 

120 

40 

40 

60 

80 

100 

120 

140 

60 

60 

80 

100 

120 

140 

160 

80 

80 

100 

120 

140 

160 

180 

100 

100 

120 

140 

160 

180 

200 


It will be seen that the composite values (A ^2 + run diagonally 
across the smTace. Thus the value 100 occurs with X 3 = 100, X 2 = Oj 
with X 2 = 50, X 3 = 50; and with X 2 — 0 , X 3 = 100. If the surface 
shown in Figure 70 were to be described by equation (95) , the curve 
for / 2 + 3 (X 2 -b X 3 ) would rise gradually from 0 to 100, then rise more 
and more sharply as it approached 200 . 

One advantage of the Court method is that it makes it possible 
to estimate with much greater accuracy the number of constants rep- 
resented by the surface. Thus in Figure 70 each curve might reason- 
ably be represented by a second-degree parabola, so seven constants 
may be assumed for the entire relation in equation (95) . If desired, we 
could write the equation in terms of parabolas, as follows: 

Zi = a + hX2 + + 63^3 + h',(.Xl) 

+ &4(A2 + X 3 ) + hi{X 2 + Xs)^ 

Stated in this form, the shape of the surface could be determined by a 
least-squares solution, giving exactly determinable shapes for each of 
the three functions, and a definite measure of the reliability of the 
results. However, unless the mathematical curves happened to be 
about right to represent the real relations, the final functions might 



DETERMINING JOINT FUNCTIONS MATHEMATICALLY 411 


not express the relationship so closely as would the freehand curves, 
determined by successive approximations. 

Where two independent variables which are to be considered jointly 
are not of the same degree of variability, a 45-degree rotation of the 
surface, such as that shown in the tabular statement of (X 2 + X 3 ), 
could still be secured by making the composite variable equal to 


X2 , X3 

I 

0’2 O’S 


If a negative rotation were desired, that could be obtained 


by using the value 


0-2 



Further, if it were desired to rotate the 


surface either less or more than 45 degrees, that could be done by 
dividing one variable or the other by a suitable constant. Thus the 
X2 ^3 

form 1-2 — would rotate the surface about 67 degrees. 


o ’2 ^3 


The general statement for the Court solution, for a two-variable 
joint function, is: 


Xi = f2(X2) + /aC^a) + /2+3 + /2~3 (96) 


Even a more complex form than equation (96) could be employed, 
by using combination functions of several different degrees of rotation 
in the same equation. Using such a combination with simple parab- 
olas for each function, Court has successfully fitted the regression 
surface shown in Figure 63, illustrating the flexibility of the method. 
It is evident, however, that much judgment is necessary in selecting 
the way the combination variable or variables are to be stated in 
equation (96), both with respect to whether the rotation is to be 
positive, or negative, or both, and the extent of the rotation to be 
used. In the apple-price problem, it was known that the statement 
of equation (95) would fit quite well, because of the prior knowledge 
of the relations expressed in Figure 70. Wlicre such information 
as to tlie shape of the function is not known a priorij considerable 
testing of different methods of statement and examination of group 
averages and profile charts like Figure 65 would be necessary to 
decide upon a form of statement which would yield adequate results. 

The Court method may be extended to n-dimension joint func- 
tions, and it has very great flexibility for this purpose. The number 
of possible combination variables becomes increasingly great as the 
number of variables increases, however, so that stable results cannot 
be secured by this method cither, unless a sufficiently large number of 
observations is available to define the relations for each of the sub- 



412 


SUPPLEMENTARY CURVILINEAR METHODS 


classes which are obtained by successive sorting on each independent 
variable. Thus using only 45-degree rotations, we should find the full 
Court equation for three independent variables to be 

X,^U(X2)+fz{X,)+U(X,) > 



Even if each function were represented by only two constants, equa- 
tion (97) would involve fourteen constants. The similar forms for 
fdur and five independent variables become increasingly complex 
It is probable that this method can be used only occasionally, where 
a very large number of observations can be obtained. For such prob- 
lems, however — as for the apple-price example, where 190 observa- 
tions were available — ^the use of equations such as (96) or (97) might 
reduce somewhat the factor of individual judgment and enable the 
researcher to determine joint relations in n-dimensional space with 
more facility than by graphic methods which essentially involve con- 
sidering individual dimensions in succession, or at most two at the 
same time. 

Measures of correlation for mathematically determined regressions. 

Where the curvilinear net regressions or regression surfaces have been 
determined by the use of mathematical functions such as those 
indicated in equations (55) and (56), then the several measures of 
closeness of fit can be obtained from the computations employed in 
determining the values of the several b's by the usual linear multiple 
correlation methods. For example, if equation (56) involving cubic 
parabolas for each variable has been employed, the regression equa- 

Xi = a + 62^2 + ^2' (•^2)^ + ^ 2 "(. X 2 )^ 

+ hXs + -t- (etc.) 

In that case the coefficient of multiple correlation with respect to the 
independent variables X 2 , Xlj Xoj X 3 , X 3 , Xl, etc., becomes the index 
of multiple correlation with respect to the variables X 2 , X 3 , etc. The 
necessary adjustments because of the number of constants represented 
in the regression equation, as indicated by equation (67), still have to 
be made, of course. With the regression curves determined mathe- 
matically, there is no question of the value of m to be used. For the 



SUMMARY 


413 


equation shown above, with only two independent variables, X 2 and X 3 , 
m is 7. With this limitation, the index of multiple correlation for mathe- 
matically determined regressions may be defined by the equation 

-^1.2, 22, 23. 3, 32, 33, . . . 71, 712, n3 = Pl.23 . . . u (98) 

Indexes of partial correlation could be worked out by parallel 
recombination of the elements involved in determining the constants 
of equation (56), but the steps necessary would become exceedingly 
complicated, and therefore are not set forth here. 

The standard error of estimate in using a mathematically deter- 
mined curvilinear regression equation is the same as the standard 
error of the multiple correlation results, wdth the appropriate cor- 
rection for the number of constants. When the index of multiple 
correlation has been determined, writh the proper adjustments, the 
standard error of estimate may readily be obtained by the formula 

5i./(23 . . . 7i) = 0*1 (1 Pl.23 . . . n) (99) 

This operation is necessarily identical with that employed in com- 
puting the standard error in linear multiple correlation, using the 
adjusted coefficient of multiple correlation. 

Differential regressions. The relation of rainfall or temperature to 
a growing crop can be measured more effectively if the distribution of 
rainfall or temperature through the entire season is considered, instead 
of breaking up the records into a scries of arbitrary periods as in vari- 
ous illustrations to this point. Many years ago R. A. Fisher developed 
a method of fitting a continuous differential regression curve to the 
rainfall through the season, showing the clianging effect of each 
inch of rainfall at different times in the growing period of the plant. 
(Note discussion on page 419 and reference IS at the end of Chapter 
23.) This technique has recently been extended to make it possible 
to obtain such a differential regression curve for one independent vari- 
able, such as rainfall distribution, wdiile simultaneously making allow- 
ance for the effect of other independent variables, such as evaporation, 
and determining the differential regression on the second independent 
factor. These methods appear to be particularly valuable in agro- 
nomic and meteorological problems, but they may also be found of 
value in other applications. They are fully presented and discussed 
in the paper by Davis and Pallcscn, listed at the end of this chapter. 

Summary. Simple curvilinear regressions, determined l)y the 
successive approximation ]')rocess, may be subjected to a final corrcc- 



414 


SUPPLEMENTARY CURVILINEAR METHODS 


tion by mathematical means; or mathematical curves for each func- 
tion may be fitted simultaneously; or certain types of joint relations 
may be represented by the use of a composite function, 6j which may 
be determined rather readily. 

The smoothing of two-variable joint functions may be facilitated 
by the use of contours (the Waugh method) drawn freehand either 
from the subgroup averages, or, in the case of high correlation, from 
the original observations. Other methods employ combination vari- 
ables composed of simple linear functions of two or more independent 
variables to rotate or warp the joint surface and so determine its shape 
other than at right angles to the axes of the independent variables. 
By using several such combination variables, and determining regres- 
sion curves for them by successive approximations, we may represent 
very complex joint functional surfaces quite closely. These methods 
may be extended to joint functions of n variables, but they become 
increasingly complex and require an increasingly large number of 
observations. Even so, these methods reduce somewhat the element 
of human judgment involved in the determination of joint functions 
and simplify the steps involved to more nearly a routine process which 
can be expected to give identical results from the same data in the 
hands of different investigators. 

It is possible to obtain standard errors of estimate and indexes 
of multiple correlation, which serve the same purpose for mathe- 
matically determined curvilinear multiple regression equations that 
the comparable coefficients serve for linear multiple regressions. Owing 
to the larger number of constants to be determined, it is even more 
important than it is with linear multiple correlation to adjust the 
several measures with respect to the number of observations and 
number of constants involved if w^e are to obtain unbiased estimates 
of the corresponding values in the universe from which the sample 
was drawn. 


REFERENCES 

BrucEj Donald. On possible modifications in the Ezekiel method of curvilinear 
multiple correlation. Typewritten manuscript, filed in the Library, Bureau of 
Agricultural Economics, U. S. Dept, of Agr., 19 pp. 

Waugh, Frederick V. The use of isotropic lines in determining regression sur- 
faces. Jour. Amer. Stat. Assoc., p. 144, June, 1929. 

Court, Andrew T. Measuring joint causation. Jour. Amer. Stat. Assoc., Vol. 
XXV, pp. 245-254, September, 1930. 

Davis, Floyd E., and Palijesen, J. E. Effect of the amount and distribution of 
rainfall and evaporation during the growing season on yields of corn and spring 
wheat. Jour. Agr. Rersearch, Vol. 6, No. 1, pp. 1-24, Washington. Jan. 1, 1940. 



CHAPTER 23 


TYPES OF PROBLEMS TO WHICH CORRELATION 
ANALYSIS HAS BEEN APPLIED 

In the preceding chapters many different practical problems have 
been used to illustrate the kinds of correlation analysis and the 
actual steps in working out the results. It may now be worth while 
to turn attention to specific research problems to which these methods 
have been applied in the past. This will indicate the type of logical 
analysis which must be made before the statistical technique can be 
applied and show something of the kind of conclusions which may 
be reached by the use of these techniques. 

Land values. One of the first comprehensive studies involving 
extensive correlation analysis was a study of land values by Haas (1 ) } 
In this study the sales prices of a number of different farms were 
obtained, and also supplementary facts about the farms, such as dis- 
tance from town, value of buildings, proportion of crop land, fertility 
of the soil, and type of road on which the farm fronted. Changes in 
land values over the period were first eliminated, and the adjusted acre 
prices related to the other factors by linear correlation. Sortings of the 
residual values were used to determine tlic regressions for some of the 
less important factors. It was found that the differences in value per 
acre had a multiple correlation of E — 0.81 with the factors mentioned 
and that acre values could be estimated from the independent factors 
with a standard error of $19 per acre. As the assessor’s valuations of 
these same farms showed a much larger error, as compared with 
the actual sales values, it was suggested that the impartial regression 
equation be substituted for the less reliable human judgment in 
assessing individual farms for taxation purposes. 

In a later study of the same type (2) the value of the farm dwelling 
and the value of the barns were considered as separate variables, and 
curvilinear regressions WTre determined. It was found in this study 
that the contribution of the farm dwelling to the farm value was a joint 
function of the value of the dwelling and the size of the farm, an 
expensive dwelling adding more to the value of a large farm than 

^ Tlio niiinboi*H in parentheses refer to reforonoes at the end of this chapter. 

415 



416 


EXAMPLES OP CORRELATION ANALYSIS 


to the value of a small one. Road type was one of the factors con- 
sidered. Three classes of roads were used, and the method explained 
in Chapter 17 was employed to determine the net difference in farm 
value per acre with differences in the type of road. Preliminary work 
in this study, with the farm value stated on a per-acre basis, gave a 
linear correlation of i? = 0.98. It was discovered, however, that this 
high correlation was due almost entirely to the presence of a few 
very small farms, which showed values of farms per acre and values 
of buildings per acre both running into the thousands of dollars. 
When these farms were excluded, the linear multiple correlation 
dropped to R = 0.64, indicating the spurious correlation obtained by 
dividing by the common factor, number of acres. In the final correla- 
tion, with curvilinear relations and joint functions being used, a 
multiple correlation of P = 0.77 was obtained. As 368 observations 
w^ere used, more complex methods could be employed for this analysis 
than would be feasible in most cases. 

Physical relations between input and output. Another type of 
problem to which multiple correlation has been applied is determining 
the physical relation between the number of input (or cost) elements 
applied in some production process and the resulting output or yield. 
This problem is particularly important in agricultural research, where 
many of the combinations of conditions which occur in practical 
farming cannot be reproduced or studied under experimental condi- 
tions, and where the number of variables is so great as to make the 
use of fully controlled experiments both lengthy and costly. 

In one of these studies (3) the gain in weight of beef steers on feed 
was related to the quantities of corn, hay, and high-protein feeds fed 
per day, to the number of days on feed, and to the initial weight of the 
animals. Curvilinear regressions were determined for all factors, pro- 
tein being the only one to indicate a true linear relation. The curves 
showed marked diminishing returns per unit of feed as added amounts 
of corn or of hay were fed per day. The younger the animals, and the 
shorter the time they were on feed, the smaller the amount of feed 
necessary to produce a given amount of gain. There were 67 observa- 
tions for this study, each representing a different bunch of cattle. The 
multiple correlation was P = 0.78.^ 

2 This study is of particular interest to the author, as it was in studying this 
particulajp problem that the successive approximation method of determining net 
regression curves was first worked out, and in this problem that regression curves 
were first determined by this method while holding the influence of other variables 
constant. 



PHYSICAL RELATIONS BETWEEN INPUT AND OUTPUT 417 


The same analysis of physical relations has been applied to the 
production of milk by dairy cows. In most of these studies (4, 5, 

6, 7) the feeds used and milk produced have been worked out on a 
herd-average basis, the record for each herd constituting one obser- 
vation. In one study, however, the records were available by indi- 
vidual cows, and the conclusions secured from those records agreed 
quite well with those obtained from the other analyses (8) . 

The total quantity of digestible nutrients in the feed, the propor- 
tion of protein in the feed, the proportion of butterfat in the milk, 
and the proportion of the herd freshening in the fall, all have been 
found to be important variables influencing the production of milk. 
Variables of less importance, but of some effect in some localities, 
have been the proportion of nutrients derived from silage, the pro- 
portion of feed fed while on pasture during the summer season, the 
age and weight of the cows, and their quality as indicated by their 
value per head. The breed of cow was considered in several studies, 
but was found to have only a negligible influence on production after 
other factors were allowed for. In spite of the fact that no measures 
have been found satisfactory for the nutrients the cows obtain from 
pastures, multiple correlations ranging up to 0.90 have been obtained 
in these studies, indicating how much the average production of a 
herd is dependent upon the physical conditions and practices. 

Similar correlation studies of the influence of physical input upon 
output have been made in the case of potatoes (9, 10) , cotton (11) , and 
other crops. In the study of potatoes, yield was found to vary with 
the amount of seed used, the quantity of manure and fertilizer applied, 
and the depth of plowing. The regression for the latter factor was 
particularly interesting, in that it was convex from above, indicating 
that maximum yields were secured with a certain depth of plowing 
and that plowing either deeper or shallower decreased the yield. 

In the study of cotton, the quantities of mixed fertilizer and of 
nitrate of soda used were considered as separate variables; the quan- 
tity of calcium arsenate applied was considered, and also the fertility 
of the land as indicated by the yield of other crops, notably corn. 
The results in this problem raise two interesting points w^hich illus- 
trate some of the logical problems which come up in correlation 
analyses. The arsenate influences the yield through killing the boll 
weevil. In the year studied there was a heavy weevil damage on 
untreated fields, and tlie applications of poison increased the yield 
very materially. But these results, indicate nothing of how much 
influence poison would have on yield in years when weevil damage 



418 


EXAMPLES OF CORRELATION ANALYSIS 


was lighter. It would be necessary to repeat the study over several 
years with varying weevil damage, and then relate the differences in 
the effectiveness of poison to differences in the climatic factors which 
affect the weevil infestation, before it would be possible to judge in 
any particular year whether or not it would pay to use poison that 
year — and the prices both of poison and of cotton would enter into 
the final consideration. 

The inclusion of yields of other crops as a factor in the multiple 
correlation raises another interesting logical point. The net regres- 
sions show that, with other factors remaining the same, farms with 
high yields of other crops also tend to have high yields of cotton. 
This might be interpreted as indicating that farms with high yields 
of other crops also have high native fertility, and that in eliminating 
this factor the results as to the effect of using the other factors have 
been made more dependable. But it may be that the high yields of 
other crops are partly due to high fertilization, either during the 
same year or in previous years. In eliminating the increased cotton 
yield associated with high yields of other crops, then, we might really 
be eliminating part of the result of high fertilization of cotton. The 
simultaneous determination of the relations by the method of multiple 
correlation tends to allow for these inter-relations, if they exist, but 
whether it does so completely in any given case may still be queried. 
In the particular case cited, collection of additional information as 
to fertilizer applied on each field in previous years would give a 
more positive answer to the question of how much was native fertility 
and how much was the result of previous treatment, and so give a real 
solution to the logical dilemma. 

Weather conditions and crop yields. Another type of complex 
physical relationships which has been satisfactorily treated by mul- 
tiple correlation is the relation of weather factors to crop yields. 
The yield problem of Chapter 14, taken from the work of Misner (12) , 
and the potato-yield problem of Chapter 21, from the work of Waugh 
(13), have already been discussed at length. Other problems of the 
same sort were early studies of the relation of rainfall in July and 
August to the size of the Illinois corn crop (14) ; studies of tlie influ- 
ence of rainfall and temperature during the growing season on cotton 
yields (15) ; studies of the influence of precipitation, temperature, 
and relative humidity on spring-wheat yields (16) ; and many others 
which might be mentioned. One interesting study related the weather 
during the winter to the yield of cotton in the South. This study 
showed that extreme cold tended to exterminate the boll weevil 



WEATHER CONDITIONS AND CROP YIELDS 


419 


and so increase the yield of cotton (17) . In spite of the fact that 
the correlation was practically perfect during the brief period of six 
years for which the study was made, the author did not believe that 
he had explained all the causes of variations in the yield of cotton, 
and modestly refrained from concluding that he had a perfect forecaster 
of cotton yields. This was fortunate, as, after giving an excellent 
forecast of yield for one year, cotton yields the second year were 
diametrically opposite to the expected yields — ^with a departure of 
many times the standard error of the previous years. This case is 
interesting as indicating the limited meaning of computed standard 
errors in time series, and as further indicating that a result which 
is not sensible logically cannot be trusted as the sole basis for fore- 
casting, no matter how high the correlation. As it was this same 
investigator who had previously worked out the influence of weather 
conditions during the growing season on the yield of cotton in indi- 
vidual states, he was forewarned as to the significance of his appar- 
ently perfect forecaster, and he was duly cautious in interpreting 
its meaning. The result was certainly important, however, as indi- 
cating that weather factors prior to planting time may be related to 
subsequent yield. 

A somcwliat different approach to the crop-yield problem has been 
taken by R. A. Fisher (18). In studying wheat yields at Rotham- 
sted, he i)ointed out that it really made little difference to the growth 
of the crop whether a given rain occurred on April 30 or May 1; 
yet if the rainfall were studied by monthly totals the assumed effect 
might be quite different. Furthermore, if weekly periods were con- 
sidered for all the different factors, the number of different constants 
in tlic regression equation might readily exceed the number of obser- 
vations. lie tluMH'fore devisc'd a method of determining the differ- 
ential relation of rainfall and yield, so as to determine the rate of 
change in yi('ld wilh the rato of change in rainfall at any season of 
the year. The diff(‘r('nihd equation required a sufficiently small num- 
ber of constants so that it could be accurately determined from the 
observations at hand. Tlie resulting smooth curve for the change in 
yield with change's in rainfall showed that the maximum effect was 
in fall and in spring, witli less effect during the winter. With rain- 
fall through the yc'ar the only weather element considered, correla* 
tions ranging from 0.32 to 0.63 were obtained for various test plots. 
Althougli this method docs not take into account joint effects of climate 
at different seasons (as did the potato-yield problem used in Chap- 
ter 22) , and the method of analysis is more complicated mathematically 



420 


EXAMPLES OF COREELATION ANALYSIS 


than any of those presented in this book, the suggestion of determin- 
ing differential regression equations may open up new possibilities of 
accurate and complete analysis. (See page 413.) 

Relation of physical characteristics of samples to chemical char- 
acteristics. A quite different application of correlation analysis has 
been in determining the extent to which the chemical properties of 
a given sample were related to, or could be estimated from, observ- 
able physical properties. The estimation of the protein content of 
wheat from the proportion of vitreous kernels, used as an illustration 
in Chapter 6, is taken from a much more comprehensive series of 
studies (19) in which the weight, the percentage of vitreous kernels, 
and the region of the country from which the wheat came were all 
found to have significant influences. In addition, it was found that 
the relations changed slightly from year to year, so that further work 
remains to be done to determine the influence of differences in climatic 
factors on the relation between physical and chemical properties. 

A somewhat different study, but also witliin the same general field 
as the last one mentioned, related the volume of bread a given quan- 
tity of flour would produce to the gluten content of the wheat and 
of the flour (20). Correlation was also used to determine the extent 
to which the digestible composition of different cuts of meat could 
be judged from the visible proportion of fat (21). Studies such as 
these illustrate how statistical methods may be used to generalize from 
the results of many tests, even where the tests themselves were car- 
ried out under the carefully controlled conditions of exact scientific 
experiment. 

A somewhat different application of statistical methods to* the 
interpretation of data secured from exact scientific measurements is 
in the astronomical problem of the relation of the brightness, inten- 
sity, and distance of the stars. Careful investigations in tins field 
(22) have leaned heavily upon correlation analysis for their final 
conclusions. 

These last two types of problems deal with purely physical rela- 
tions, which remain the same, or at worst change only gradually, 
over a series of years. The idea of a statistical universe which is 
being sampled may therefore have some application, though it is 
sometimes a limited one. But in die next type of problem, tliough 
the universe is stable at any one time, it may change radically from 
year to year, so that conclusions for one year may not be at all ap- 
plicable to those of succeeding years. 



RELATION or FARM ORGANIZATION TO FARM INCOME 421 


Relation of farm organization to farm income. The question 
of what organization will produce the best returns for the farnas in 
a given locality is one that has been given extensive statistical investi- 
gation, by correlation means and otherwise. In particular, studies 
of farm income in Pennsylvania (23), Iowa (24), and Virginia (25), to 
mention specific cases, have made extensive use of multiple correla- 
tion analysis. Such factors have been considered as the size of the 
farm, the acreage in each of the principal crops, the size of the 
important livestock enterprises, the efficiency of crop production and 
livestock production, and the capital invested. In general it has 
been found that about half the variation in earnings from farm to 
farm in the same year can be explained by such objective measures 
of their organization and management as just mentioned. The mul- 
tiple correlations with income range up to a maximum of about 0.75 
to*0.80. In addition, it has been found that the size of the dominant 
enterprise and the efficiency with which it is conducted are usually 
the most important factors affecting returns. Thus on Iowa hog 
farms (24) the yield of corn, the number of brood sows, and the 
efficiency of hog production arc dominant factors; on Virginia tobacco 
farms (25), the acreage in tol)acco, the yield of tobacco per acre, and 
the quality of tobacco; and on Pennsylvania dairy farms (231 » the 
number of dairy cows and the efficiency of the dairy enterprise. 

Beyond these broad generalizations, however, the results of de- 
tailed statistical studies of this type are distinctly limited. In the 
first place, the results hold true only for the particular year in which 
the records were collected. Differences in yields from one year to 
another and chang(‘s in the prices of each product and of each cost fac- 
tor modify both the jdiysical and the economic situation, so that many 
allowances must l)e made l)eforc the results can be applied in another 
year. Even if satisfactory adjustments can be made, there is still 
another limitation. Each individual farm is a different entity, and 
the organization whicli ])roduces the l)est results on the average will 
not necessarily b(^ the best for any one individual farm. If it were 
possible to ol)serve one farm under one hundred different types of 
organization and oi)eration, and record the resulting profit secured 
under each one, it would then l)e possible to judge from the analysis of 
those records what type of organization would yield the maximum re- 
turns for that farm under the same price conditions. But with the 
records of diffiMamt farms representing not lik(‘ entities but entities 
more or less unlike, the conclusions are not so api)licable. Only on the 
assumi)tion that the observations are drawn from a homogenous uni- 



422 


EXAMPLES OF CORRELATION ANALYSIS 


verse of similar conditions can the results of statistical studies of farm 
organization be interpreted to give the best organization for any one 
farm — and the areas where this assumption is justified are probably 
very few. 

Relation of economic conditions to market price for a commodity. 

All the problems discussed to this point have been such that a cer- 
tain universe might be specified, even though that universe would 
be likely to change more or less with the passage of time. The 
problem of prices, though, is of an entirely different character, for 
there only a single observation can be drawn for any given length 
of period, and the next period is essentially in a different universe. 
Even so, however, there is enough continuity to the way that indi- 
vidual persons react in the aggregate, and enough similarity between 
successive years, so that fairly stable results can sometimes be se- 
cured, and, where the change in reaction is continuous and progressive, 
that change itself can be made one variable in the analysis. 

Annual prices. The simplest price studies are those which relate 
the market price for a commodity to the supply for a marketing 
year. The early work on this line by Moore (26) indicated the gen- 
eral relation of supply to price for corn, hay, oats, potatoes, and 
cotton. The influence of changing conditions were eliminated mainly 
by the use of first differences, so the resulting curves were not sus- 
ceptible of logical economic interpretation. More recent work on 
potatoes (27, 28, 29), oats (30), and cotton (31, 32) has recognized 
the influence of price levels, trends in demand, carryover from pre- 
vious years, and the prices of competing products as factors influ- 
encing price along with supply; and multiple correlation or alterna- 
tive techniques are used to take into account the influence of the 
different variables. With relatively short periods on which to base 
the analyses, exceedingly high correlations have been secured in many 
cases — frequently above 0.95, even after adjusting for the number of 
constants. When forecasts have been made ahead, however, they 
have met with variable success, the forecasts in some years working 
out practically as well as in the period on which the analysis was 
based and in other cases missing by wide margins, sometimes by many 
iimes the standard error of estimate. 

These extreme errors in forecasting seem to be due to the element 
of fortuitousness in economic events. Thus a factor which has been 
fairly constant for a number of years, and hence has shown little 
influence on price, may suddenly become very important — and upset 
a forecast based on the years in which it was unimportant. In addi- 



ECONOMIC CONDITIONS AND MARKET PRICE 


423 


tion to such universally upsetting changes as the outbreak of the 
World War, other illustrations of such sporadic and unforecasted 
events are the sudden decrease in the foreign demand for American 
hog products in the spring and summer of 1927 and the increasing 
competition of Indian cotton with American cotton in 1928 and 1929.® 
Monthly prices. When monthly prices are considered, more elabo- 
rate statistical studies have been possible, with a larger number of 
individual observations. Although questions may be raised as to how 
closely the successive monthly prices of a staple commodity are 
really independent of each other, there is no question but that condi- 
tions are constantly changing, so that there are some elements of 
independence between successive observations. Of the earlier studies of 
monthly prices, a study of cotton prices by Smith (33) is of particular 
.economic interest for its separation of the influence of actual and of 
prospective supply on price, and of the shifting of the regression 
curves for these factors through the season — determined as joint 
functions of the month and of the variable. A study of hog prices (34) 
developed both an empirical forecaster of prices and an economic 
interpretation of the influence of market receipts, storage stocks, com- 
peting products, and business conditions, on prices. The correlations 
were relatively low, however, and subsequent analyses have materially 
modified many of the conclusions. The monthly forecasts of hog prices 
based on this study were not as accurate as were forecasts which took 
a broader range of elements into consideration (35) . Studies of monthly 
hog prices in Germany by Hanau (36) along the same line yielded rea- 
sonable results and gave forecasts which worked well in practice. A 
study of monthly prices of dressed lamb (37) which gave a correlation 
of 0.98 for the seventeen years studied is noteworthy in that the same 
formula served to estimate monthly prices (from current supply data) 
for three years afterwards, with almost the same accuracy as during 
the period studied. This analysis considered monthly per capita 
supplies, ])rice level, eomi^eting products, and business activity as in- 
dependent factors, and determined trend and seasonal variation while 
simultaneously eliminating the influence of other factors. 

The studies mentioned include only a small portion of the statis- 
tical studies of price which have been made, but indicate some of the 
many ways in which statistical analysis has been used in this field — 

^ Diirini; tlio Rrojit oeonomic (lcpros.sion after 1929, on the contrary, many price- 
analysis correlations eontinned to give fairly reliable forecast^?, despite the great 
incToaso in iho amplitude of fluctuation in industrial atdivity and consumer buying 
power. 



424 


EXAMPLES OF CORRELATION ANALYSIS 


and also some of the dangers and pitfalls that beset the investigator. 
Price analysis is the last place to apply statistical methods without 
thorough logical and economic analysis of the particular problem. 

Weekly or daily yrices. For very perishable products, where sup- 
plies and prices may fluctuate widely even from day to day, price 
studies may deal with the average prices for a week or even for an 
individual day. Representative statistical studies of this type are 
those of watermelons by Hedden and Cherniack (38) and of peaches 
by Kantor (39). In both these studies it was necessary to take 
into account a regular variation in demand from day to day of the 
week. Other factors influencing demand were also considered, and 
it was found that temperature had a marked influence on the price 
that would be paid for a given supply of watermelons. These short- 
period studies both related to an individual large market — New York 
City. Similar studies have been made for other markets and other 
products. 

Relation of characteristics of different lots of a commodity to 
prices at which they sell. All the price studies which have just been 
discussed treated the reasons for the change in prices from time to 
time, for lots of the commodity of uniform or of average quality, and 
at the same stage of the marketing process. As has been pointed out, 
only one observation can be drawn from each successive universe. A 
type of study which presents different statistical problems is that of 
determining why different lots of the same commodity, sold within a 
given period and at the same stage of the marketing process, should 
sell for different prices. In this case there is a true universe — all the 
sales of the specified kind taking place within tlie specified period 
— and as large a sample as is desired can be secured, up to the limits 
of the universe. The studies of land prices previously mentioned 
arc one example of this type of analysis; and the study of the rela- 
tion of the price of apples to size, insect injury, and scab, used as 
an illustration in Chapter 23, is another example. 

One of the most interesting studies of this type related the prices 
of different lots of asparagus to the length of green color in the stalk, 
the number of stalks in the bunch, and the uniformity of the stalks 
(40). The results of this study, presented very effectively in pic- 
torial style (as reproduced in Fig. 75), have had a marked influence 
on the practices by the producers who supply the Boston market 
and have led to further experimental investigation as to how to pro- 
duce asparagus with the desirable qualities (41). Similar studies 
have been made of the influence of size, color, interior quality, and 



CHARACTERISTICS OF DIFFERENT LOTS AND PRICES 425 


type of carton on the prices received for eggs sold at retail, for both 
the New York and Philadelphia markets (42) and for the Wilming- 
ton market (43). 

One logical point which cannot be overlooked in studies of the 



effect of quality on price, however, is that the premiums paid for 
high-ciuality lots may vary from time to time with differences in the 
relative supply of products of the different qualities. That is to say, 
though the conclusions as to the effect of quality upon price do apply 




426 


EXAMPLES OF CORRELATION ANALYSIS 


in the universe from which the observations .were drawn — ^with cer- 
tain conditions as to the supply of the different sizes and qualities — 
they may not apply in a different universe in which the circumstances 
have changed. Other studies have therefore attempted to determine 
not only how the prices vary for different qualities under the set of 
supply conditions at one particular time but also how the premium 
or discounts varied from time to time with differences in the supplies 
of each quality. Thus studies of the influence of protein content, 
weight per bushel, dockage, and grade on the prices received for dif- 
ferent cars of wheat (44) have shown that in crop years when high- 
protein wheat is very scarce, a wheat of high protein commands a 
marked premium, and that factor is much more important than 
weight; whereas, in years when high-protein wheat is more plentiful 
but much of the wheat is underweight, the weight factor becomes 
relatively more important, with the protein premium becoming of 
much less significance. With records of more than a thousand cars 
per year for several years, the changes in premiums were determined 
from month to month, by using two joint functions, one for month and 
protein content and the other for month and weight per bushel (44) . 

The effect of varying supplies of different sizes and varieties has 
also been studied in the case of peach prices (39). Here it was found 
that the premium for competing varieties changed with the supply of 
each, a variety which sold at a premium when only a small portion 
of the total supply, selling at a discount when it exceeded a certain 
percentage of the supply. The premium for peaches of darge size, 
however, tended to persist in spite of increstsed supplies,' though it was 
reduced somewhat when the proportion of large-sized peaches increased. 

These last two groups of studies illustrate the way in which 
changing universes (in time series) may yet be brought within the 
purview of statistical analysis, and conclusions may be reached which 
will be of value in new sets of circumstances. If the complex of 
conditions changes from time to time because of factors such as dif- 
ferences in supply of different sizes, qualities, or varieties, or recurring 
differences in demand from day to day through the week, or from 
month to month through the year, which factors can be objectively 
taken account of and their influence measured with respect to the 
dependent factor or with respect to the influence of other variables 
on the dependent factor (joint relationships), then the fact that the 
circumstances are changing ceases to be a ^^bug-a-boo,^^ because the 
reasons for the changes may be determined and allowed for. Just 
how far the conclusions from such analyses will hold under changed 



OTHER PRICE STUDIES 


427 


conditions depends upon how adequately the real causes of the changes 
from time to time have been determined, and how much unaccount- 
able dynamic or evolutionary change there has been and may be. 
But even so, this approach seems the hopeful one in treating the 
baffling problem of changing conditions in time series; and it may 
yet be possible to apply laws of sampling and to make statistical 
forecasts for these cases with the same confidence that they can be 
made for stable universes. 

Other price studies. Other types of price-analysis studies which 
may be mentioned briefly are those of differences in prices between 
different points in space or of different points in the marketing 
process. Correlation analysis has been applied to the first of these 
problems (45) in studying the relative influence of changes in freight 
rates, location of supplies, and price level on the margin between 
potato prices in Minneapolis and New York. Some studies of mar- 
keting costs (46, 47) have indicated the influence of size of creamery, 
distance of haul, and methods of operation on creamery costs and 
hence on prices received by farmers for their cream; but the general 
subject of the relation of prices of the same product at different points 
in the marketing process has not otherwise been investigated, except 
in the most general way (48). 

Another variety of price study is in determining the influence of 
prices on the quantity of a product moved into consumption. In 
making studies of this sort for milk (49), it has been found that 
season of the year, day of the week, holidays, changing food habits, 
and income of the consumer have more influence on consumption 
than do price changes; but after these are eliminated a slight but 
significant change in consumption with change in price may be found. 
With cotton, on the contrary (32), it was found that price alone, with 
an upward trend in demand, almost completely determined the quantity 
consumed throughout the world; whereas the quantity consumed in 
the United States was also influenced by the general level of industrial 
activity (50). A similar relation for consumption of hog products to 
price was shown as an illustration in Chapter 6, Table 27. A parallel 
type of study indicates the effect of price on the quantity of cotton 
carried over at the end of the season or withheld or used by the pro- 
ducers. Thus it has been found that in years of low potato prices 
producers feed or waste much larger quantities, whereas when the 
prices fall below certain points much of the supply is left in the ground 
undiig (51). 

Tn all these price studies it must be recognized that logically 



428 


EXAMPLES OF COBBELATION ANALYSIS 


price does not of arid by itself determine consumption, carryover, and 
waste, nor does supply alone determine price. Instead where com- 
petition is effective there is a continuous dynamic balance of all the 
factors, which has been aptly described by the great economist 
Alfred Marshall as the closing of a pair of shears, where neither 
blade alone does the cutting. When, however, the relations have been 
analyzed step by step in the various ways which have been described, 
the different relations may then be pieced together in a harmonious 
whole which is logically consistent and which gives concrete state- 
ment to the economic hypotheses concerned (52). 

Relation of changes in production to prices and other factors. 
Another type of problem in which prices are involved, but only as 
independent factors, is studying the influence of price changes on 
changes in production. The distinctive characteristic of these studies 
is that the prices in one period must be related to production in some 
subsequent period or periods, the length of lag depending on the 
technological length of the production process and on the time it 
takes producers to respond to changes in prices. One of the first 
studies of this type related cotton acreage to prices for the previous 
season (53). Subsequent experience showed, however, that continued 
high prices for two seasons might have a different influence than for 
a single season alone (54). Studies of hogs showed that it took 
eighteen months for differences in prices to be reflected in market 
receipts (34) . The price of corn was found of equal importance with 
the price of hogs in causing changes in hog production. Hog produc- 
tion has also been studied by different type-of-farming areas, and this 
detailed study has shown marked differences in the responses to prices 
in different areas, depending on the position the hog enterprise occu- 
pied in the farming system. The weather conditions during the far- 
I’owing season in the spring and the relation of corn prices to hog 
X^rices during several critical periods in the production process were 
found to be important factors (55). The production of milk has 
likewise been found to respond to changes in the relation of the price 
of the milk to the costs of feedstuff (56). There is a short-time 
effect which is due to changes in the intensity of feeding and a long- 
time effect which is due to changes in the number of cows (57). The 
acreage of potatoes reflects prices for two years preceding, as well 
as prices for the year before. The responses for potatoes arc quite 
parallel in different areas, though there are some important dif- 
ferences reflecting differences in the position the enterprise occupies 
in the farming system (54). In the case of some mino^* crops, the 



CORRELATION IN PSYCHOLOGY AND EDUCATION 


429 


prices for the major crop of the region have as much influence on 
subsequent acreages of these competing crops as do the prices of the 
minor crops themselves. Thus sweet-potato acreage is influenced by 
cotton prices, and flax acreage by wheat prices. In other cases, 
yields or per-acre returns for preceding years must be considered, as 
well as prices alone. The general price level of competitive products 
or of all commodities has also usually been considered in judging the 
significance of a particular price. 

In most of these studies of production responses, it has been found 
necessary to state the subsequent acreage or production as a per- 
centage of, or as an absolute increase or decrease from, the acreage 
or production of the preceding year or production period. Stating 
the relation in this way recognizes the fact that the farmer or other 
producer must plan the next yearns operations not with reference to 
any hypothetical normal or average but with reference to the actual 
production situation of the current year. Often, as is illustrated in 
Figure 76, a very high price will not call forth any larger increase 
in production in the following year than will a moderately high 
price, owing to the inability of the producer to expand his operations 
more than a certain extent in any one year. In this respect this 
type of price study is quite distinct from other types, for in many 
studies of the response of prices to supplies more satisfactory results 
have been secured by working with the absolute figures rather than 
with changes from year to year. 

Miscellaneous agricultural problems. Another group of studies 
has investigated the relation of the physical characteristics of plants 
or animals to tlieir ability to produce. Studies of dairy cows, by 
Oowcn (bS), restricted to simple two-variable correlations, have indi- 
cated that most of the factors in the physical coTiformation of dairy 
cows have little or no relation to productive ability. Studies of the 
relation of the size and shape of corn kernels, ears, and plants to 
w(‘ight of the grain (59) and multiple correlations by Richey which 
took the actual yielding ability as the criterion (60) have led largely 
to the same result. These studies imlicatc that many of the time- 
honored i^oints which have been stressed in agricultural show com- 
ix'titions and in breeding selection have no utilitarian significance 
and have led to a new stress on ])erformancc records rather than 
])hysical api)earanee as the ultimate test. 

Correlation in psychology and education. Correlation and mul- 
tiple correlation methods have been widely applied in educational 
and psychological investigations to the study of such problems as the 



430 


EXAMPLES OF CORRELATION ANALYSIS 


Relation Between Price and Subsequent Changes in 
Acreage AND number of hogs 

( 100 ■PRICE PBCCEDING YEAR OR 1028 ACREAGE J 



Fig. 76. Changes in acreage or production -with changes in prices received, for 
different agricultural products. (From reference 54, by Louis H. Bean.) 


relation of grades in one subject to grades in another v61), or the 
scores on one mental test to scores on another (62), or the relation 
of scores on mental tests to success in the schoolroom (63) or in later 





CORRELATION IN PSYCHOLOGY AND EDUCATION 


431 


life (64). Studies have also been made of the relation of mental 
and physical characteristics to success in different occupations, such 
as the relation of the relative success of individual farmers to their 
training, schooling, initiative, business ability, etc. (65). This latter 
study, which indicated that approximately half the differences in 
farmers' financial success could be accounted for on the basis of 
individual differences in the men, has a tantalizing tie-up with the 
studies of farm management, which show that roughly half the dif- 
ferences in income can be explained by the way the farms are run. 
Apparently, by considering both the characteristics of the farmer and 
the way the farm is organized and run, it would be possible to account 
for all the differences in income. But if the men with superior mental 
ability are the men whose farms were organized and run in a superior 
manner, the ratings of the farmers and of their farming methods 
would be merely overlapping measures of the same thing. 

In most of the cases in which correlation analysis has been applied 
to psychological problems, it has been used primarily to measure close- 
ness of relationship rather than to obtain a basis for estimating one 
variable from another. In studies of this type even a low correlation 
may be important, so long as it is large enough so as not to be due to 
random fluctuations. Thus one study reached the conclusion that even 
in groups of the same economic and social status, there is a small nega- 
tive correlation between number of children per family and intel- 
ligence (66). The psychologists and the biologists might have a warm 
argument, though, as to which was cause and which was effect! In an- 
other study, in which a given test was repeated, with twice as much 
time to complete it the second time, the scores made on the two trials 
were correlated, and correlations of 0.76 to 0.91 wore found. These 
correlations were made the basis for concluding tl\at the tests determine 
power alone, ratlicr than speed (67). Inasmuch as a correlation of 
0.76 means that nearly lialf the variance in tlie two factors is not 
associated, it might be questioned whether this interpretation is alto- 
gether satisfactoiy. Here the use of r (= 0.76) instead of d (= 0.58) 
leads to overstressing tlie significance of the observed correlation. 
Many other applications of correlation or partial correlation in psy- 
chological research (68, 69, 70) illustrate the usual tendency to de- 
pend on correlation coefficients, rather than regression equations, as the 
means of expressing relationship. 

Interesting results have also been secured by the application of 
correlation methods to problems on the border line between psychol- 
ogy, sociology, and political science. Thus in a study of factors 



432 


EXAMPLES OF COREELATION ANALYSIS 


influencing the attitudes of mothers toward sex education, it was 
found that a number of measures of previous environment showed 
no significant correlation with the mother^s attitude; but that there 
was a significant correlation between their opinion and the amount 
of sex education given their children (71). Another interesting study 
on the political-sociological border line determined the intercorrela- 
tions between the quantities of information, misinformation, and 
prejudice possessed by college students, and their grades, and their 
conservative or radical political positions. High prejudice, high mis- 
information, low grades, and conservatism were found to be associated; 
and likewise low prejudice, good grades, low misinformation, and 
radicalism. The correlations were low in all cases, however (72). 

The use of correlation methods in the field of education and 
psychology has been hampered by the fact that in many cases the 
factors dealt with are not tangible facts which can be objectively 
measured but are intangibles which can be only roughly approxi- 
mated by some process such as ranking. If anything approaching 
a normal distribution of the factor considered is assumed, ranking 
tends to make the true difference between successive individuals in 
the series much less in the central portions of the array than in the 
extreme portions. Furtherrnore, the ranked series is a discrete series, 
with the possibility always present that the sixth item in order, for 
example, may exceed the seventh item by 10 times the amount that 
the seventh exceeds the eighth, or vice versa. Both these difficulties 
are apparent in the accompanying set of data. 


Grades Received bt 24 Persons Taking an Examination' in Statistics 


Rank 

Grade 

Rank 

Grade 

Rank 

Grade 

Rank 

Grade 

1 

99 

7 

94 

13 

85 

19 

81 

2 

98 

8 

91 

14 

85 

20 

80 

3 

98 

9 

90 

15 

85 

21 

77 

4 

98 


88 

16 

83 

22 

75 

5 

97 

11 

87 

17 

83 

23 

69 

6 

94 

12 

85 

18 

82 

24 

68 


The difficulties enumerated have made psychological workers and 
educators feel that the standard Pearsonian methods of correlation 
(those presented in Chapters 4 and 5, and 12 and 13, of this book) 
are not applicable to their data, and have led to the development of 


MORE RECENT APPLICATIONS OF CORRELATION ANALYSIS 433 


various alternative methods, such as the Spearman ^‘foot-rule corre- 
lation’’ for ranked data (73) and other similar short cuts. It is not 
evident that these new measures meet the difihculties enumerated, and 
furthermore they give measures of correlation which differ from the 
Pearsonian coeflicients for the same data. The use of curvilinear 
regressions, as discussed in Chapter 7 and subsequent chapters, partly 
meets the difiSculties in handling such data, since the effect of a vary- 
ing significance of the unit of measurement in different portions of 
the range may result in transforming what would otherwise be linear 
regressions to a non-linear shape. That does not, however, meet the 
difficulties of the discrete or ^^jumpy” quality of ranked values; nor 
does it seem that any other statistical treatment is likely to do so 
completely. 

Where the dependent variable is definitely discrete, so that two or 
more categories can be recognized, but no continuous variation can 
be assumed, correlation methods are clearly inapplicable. Special 
statistical measures of association, parallel to the correlation coeffi- 
cient, have been worked out for such problems (74). 

No attempt has been made in this book to treat the special cor- 
relation methods developed in educational and psychological work. 
Instead, it has been restricted to the analysis of dependent variables 
which were continuously variable or which could logically be thrown 
into that form. 

Correlation analysis in other fields. The types of problems which 
have been discussed do not begin to exhaust the uses which have 
been made of correlation, simple or multiple, in research work. Since 
they are drawn largely from the author’s own range of interest, 
they are heavily weiglitecl by the agricultural or even the agricultural 
economic field. Random examples of correlation work in other fields 
are the use of multiple correlation to obtain a definite formula for 
forecasting pig-iron production (75), to determine the extent to which 
freight rates are influenced by the factors of terminal charges, length 
of haul, expense of operation, and other factors (76) , or to determine 
how far meat sales in different branch houses are influenced by local 
conditions of demand, and what a reasonable quota might be (77). 

More recent applications of correlation analysis. The discussion 
to this point in this chapter remains substantially unchanged from 
that of the first edition of this book. Since that edition was published, 
there has been a vast expansion of research work in many of these 
fields. In some fields, such as commodity price analysis, an entire 
book would be required merely to discuss subsequent studies (78). 



434 


EXAMPLES OF COREELATION ANALYSIS 


Here we shall simply note briefly some of the additional fields to 
which , correlation analysis has been applied in the decade since the 
first edition was published, without attempting to appraise the sub- 
sequent work in the fields already mentioned. 

Price-making forces for industrial commodities. The same methods 
used earlier with farm-product prices have more recently begun to 
be applied to the explanation of industrial price-making forces. Steel 
(79), automobiles (80), houses (81), and ships (82) illustrate some 
of these studies. In fields where free competition does not prevail, 
but the dominance of a few large concerns produces monopolistic 
competition, the supply and price relations may operate quite dif- 
ferently from the way they operate under fuller competition (83) . 
In such cases great care is necessary to set up the statistical analyses 
in such terms as to represent the market situation as it really functions 
in the given industry. 

Production functions for industries. The relation of volume of out- 
put to average cost per unit is an important consideration both in 
economic theory and in industrial organization. Recent overall studies 
for certain large concerns (84, 85, 86) have revealed the cost function 
for such products as steel, hosiery, and furniture. In some of these 
studies, multiple correlation was used to measure the influence of 
percentage of capacity operated on total cost or per-unit cost, while 
simultaneously holding constant other factors such as wage rates, 
price levels, or changing labor efficiency. 

Size standards for children’s clothes. Quite a different recent ap- 
plication of correlation technique was made in a study of appropriate 
size standards for children's clothes, conducted by the Bureau of Home 
Economics (87). In this study, all possible bodily dimensions were 
measured for thousands of children all over the country, together with 
their age, sex, and race. Multiple correlation was used to determine 
which of these measurements were most important in judging size as 
a whole. It was found that height and girth at hips were the most 
important. After these were allowed for, age was found to have no 
appreciable relation to the other bodily measurements. A new system 
of clothes sizes, based on the distribution of these two measurements, 
was recommended to clothing manufacturers. By using these sizes 
instead of the conventional age sizes, it will be possible to have ready- 
made clothing which can be bought merely by size, and yet have a 
satisfactory fit for a large proportion of all children. 

Measures of components of intelligence. Certain workers in psy- 
chology and education have modified correlation procedures to in- 



MORE RECENT APPLICATIONS OF CORRELATION ANALYSIS 435 


vestigate the problem of how many independent factors are involved 
in intelligence. Spearman introduced the theory that there was one 
general factor which ran through all intelligence tests, plus various 
specific factors in each test, and made extensive statistical studies to 
substantiate this claim (88). Other students advanced the theory 
that three or four general factors, differently weighted in each case, 
could explain all the different measures of intelligence (89) . Although 
these investigations have led into involved calculations and highly 
refined mathematics, their actual significance is still in doubt. 

Explanations of political behavior. During the past decade the 
methods of statistical analysis, especially of sampling, have been ex- 
tensively applied to the field of political behavior. The earlier Literary 
Digest Poll, and the more refined and scientific Gallup Poll and 
Roper Poll, have become almost household words. Along with these, 
correlation analysis has been used to show the relations between votes 
by states and national averages, and to develop the predicting re- 
liability of opinions or votes in particular areas (90). Correlation 
methods, including some of the highly involved methods of psy- 
chological studies referred to in the preceding paragraph, have also 
been used in detailed studies of political structure and behavior in 
particular cities or localities (91). 

Tests of correlation results. AVith the passage of the years it has 
been possible to verify some of the earlier studies by applying them to 
later data, or by analyzing data for entire subsequent periods to see 
if they gave comparable results. Some subsequent studies of the 
response of milk production to prices received, however, gave quite 
different results from those given by earlier studies, and led to the 
conclusion that factors which had shown a high correlation with pro- 
duction while the industry was exi)anding in a given region failed to 
have the same significance after maturity was reached (92). In this 
case the economic growth proved to be irreversible. These studies 
prompted more detailed analysis of the problem and led to the de- 
velopment of more intensive techniques, which consider not only prices 
but also the whole farm-management organization of typical farms 
in reaching conclusions as to the long-run response of production to 
price (93). In a quite different case, the response of milk production 
to variations in feed input is being tested by elaborate feeding ex- 
periments, with the resulting data subjected to thorough statistical 
analyses (94). The preliminary results from these analyses show a 
net relation of milk output to feed input which agrees surprisingly 
well with the same relation as determined earlier by multiple correla- 



436 


EXAMPLES OE COREELATION ANALYSIS 


tion analysis (6) from cow-testing association records of actual farm 
experience (95). 

Other applications. Other new applications of correlation methods 
have been made in testing the strength of materials when subjected 
to varying stresses, in determining the effect of various local water 
characteristics on the amount of inside deposit in water or steam pipes 
made of various materials, and in establishing sales quotas or ad- 
vertising allotments for specific products in various districts in the 
light of the industrial and economic characteristics of each district. 
Since these studies were made in private research agencies for the 
benefit of private concerns, the results have usually not been published. 
In some cases the findings are regarded as valuable trade secrets. The 
variety of problems to which correlation, and especially multiple cor- 
relation, has been applied, does, however, indicate the significance 
of this technique as a means of unlocking secrets of relationship in 
many cases where they could be discovered in no other way. 

Many more pages might be filled with the details of studies such 
as those discussed. But probably enough has been presented to illus- 
trate the wide range of problems in which the use of statistical analy- 
sis sheds new light on the relationships present and their significance. 
It may be hoped that these illustrations have developed the necessity 
for careful logical analysis of each problem to which statistical analy- 
sis is to be applied, and have indicated the need both for good theo- 
retical knowledge of the field in which the problem lies and for thorough 
technological knowledge of the elements involved in the particular 
problem. The technological knowledge is particularly important in 
selecting the different factors or in deciding on their statement or 
interpretation. 

No attempt has been made here to list all the significant statis- 
tical studies in any one of the fields discussed, or to evaluate their 
importance. Instead, the studies mentioned have been selected solely 
to illustrate various specific points; in many cases a significant study 
has not been referred to because the point was already covered, or a 
relatively unimportant study has been mentioned because of its per- 
tinence to a particular topic. This discussion should therefore not 
be regarded as a critical evaluation of the w^ork in any of the fields 
touched upon. That has been left for experts in each field. Instead, 
the comments are intended solely to develop the variety, complexity, 
and significance of the problems to which statistical analysis may be 
applied and the care and thought which are even more necessary 



EEFERENCES 437 

than the statistical computations, if the results are to be of lasting 
value. 


REFERENCES 

1. Haas, G. C. Sale prices as a basis for farm land appraisal. TJniv, Minn. Agr. 

Expt. Sta. Tech. Bui. 9. 1922. 

2. Ezeiviel, Mordecai. Factors affecting fanners’ earnings in Southeastern Penn- 

sylvania. IJ. S. Dept. Agr. Bui. 1400, pp. 39-60. 1926. 

3. Tolley, H. R., J. D. Black, and M. J. B. Ezekiel. Input as related to output 

in farm orgjinization and cost-of-production studies. U. S. Dept. Agr. Bui. 
1277, pp. 7-12. 1924. 

4. Misner, E. G. Relation of the composition of rations on some New York 

dairy farms to the economics of milk production. Cornell Univ. Agr. Expt. 
Sta. Memoir Qi. 1923. 

5. Vernon, J. J., C. W. Hoi.daway, Mordecai Ezekiel, and R. S. Kifer. Factors 

affecting retums from the dairy enterprise in the Shenandoah Valley. Va. 
Agr. Expt. Sta. Bui. 257, pp. 32-42 1927. 

6. Ezekiel, M. J. B., P. E. McNall, and F. B. Morrison. Practices responsible 

for variations in physical requirements and economic costs of milk produc- 
tion on Wisconsin dairy farms. Tffs. Agr. Expt. Sta. Research Bui. 79. 
1927. 

7. Pond, Georob, and Mordecai Ezekiel. A study of some hictors affecting the 

physical and economic ('osts of butterfat producjtion in Pine County, Minn. 
Uuiv. Mifiu. Agr. Expt. Sta. Bui. 270. 1930. 

8. Johnson, Sherman E., J. 0. Tretsvkn, Mordecai Ezekiel, and 0. V. Wells. 

Organization, feeding methods, and other practices affecting returns on irri- 
gated dairy farms in Western Montana. I'niv. Montana Agr. Expt. Sta. 
Bui. 264. 1932. 

9. H.ardenbitrg, E. V. A study, by the crop survey method, of factors influencing 

the yield of potatoes. Cornell Univ. Agr. Expt. Sta. Memoir 57. 1922. 

10. Reference (3) above, pj). 16-18. 

11. Westhrook, E. C., W. a. Minor, Jr., Kenneth Thaynor, C. I.. Goodrich, and 

W. C. Funk. .\n ('cononiit^ study of farm organization in Sumter County. 
Ceorgia Stale College oj Agr. Hal. 324, jip. 82-87. Decc’mlxu’, 1927. 

12. Misner, E. G. Studies of the relation of weatli(M* to the product ion sind price 

of farm |)roducts. 1. (’orn. Cornell Thiiv., mimeographed publication. 
March, 1928. 

13. Waikjii, litEDEuicK V., Chester D. Stevens, and Gustave Buumeistek. 

M(‘thods of forecast ing New England potato yields. U. S. Dc'pt. Agr., Bur. 
Agr. h^icon., miimaigraphed report. F('bruary, 1929. 

11. Moore, Henry L. Economic CgclcN; Their Law and Cu//.st, jip. 35-44. Mnc- 
millMn. 1914. 

15. Smith, BhadI'’ord B. Relation Ix'tween weather conditions and yield of cotton 
ill Bouisiami. Jour. Agr. Res., Vol. XXX, No. 11, pp. 1083-1086. June 1, 
1925. 

1(>. J’A'rroN, Palmer. Relationship of weather to crops in the plains region of 
Montana. Mont . Expt. Sta. Bui. 206. 1927. 



438 EXAMPLES OF CORRELATION ANALYSIS 

17. Smith, Bradford B. The adjustment of agricultural production to demand. 

Jour. Farm. Econ., Vol. VIII, No. 2, pp. 163-165. April, 1926. 

18. Fisher, R. A. The influence of rainfall upon the yield of wheat at Rotham- 

sted. Phil, Tram., B., CCXIII, pp. 89-142. 1924. 

19. Shollenberger, J. H!., and Corinne F. Kyle. Correlation of kernel texture, 

test weight per bushel, and protein content of hard red spring wheat. Jour. 
Agr. Res., Vol. 35, No. 12, pp. 1137-1150. Dec. 15, 1927. 

20. Coleman, D. A., H. B. Dixon, and H. C. Fellows. Comparison of some 

physical and chemical tests for determining the quality of gluten in wheat 
and flour. Jour. Agr. Res., Vol. 34, No. 3, pp. 241-264. Feb. 1, 1927. 

21. Chatpibld, Charlotte. Proximate composition of beef. U. S. Dept. Agr., 

Dept. Circular 389. 1926. 

22. Pettit, Edison. Ultra-violet solar radiation. Proc. Nat. Acad. Sciences, 13. 

p. 380. 1927. 

23. Reference (2) above, pp. 20-25, 54-59. 

24. Taylor, C. C. A statistical analysis of farm management data. Jour. Farm 

Econ., V, pp. 153-162. June, 1923. 

25. Vernon, J. J., and M. J. B. Ezekiel. Causes of profit or loss on Virginia 

tobacco farms. Va. Agr. Expt. Sta. Bui. 241. 1925. 

26. Moore, Henry L. Economic Cycles; Their I^aw and Cause, pp. 63-134. Mac- 

millan. 1914. 

Forecasting the Yield and Price of Cotton. Macmillan. 1917. 

27. Working, Holbrook. Factors determining the price of potatoes in St. Paul 

and Minneapolis. TJniv. Minn. Agr. Expt. Sta. Tech. Bui. 10. 1922. 

28. Factors affecting the price of Minnesota potatoes. Minn. Agr. Expt. Sta. 

Tech. Bui. 29. 1925. 

29. Waugh, Frederick V. Forecasting prices of New Jersey white potatoes and 

sweet potatoes. N. J. State Dept. Agr. Circ. 78. 1924. 

30. Killouqh, Hugh B. What makes the price of oats. XJ. S. Dept. Agr. Bui. 1351, 

1925. 

31. Reference (17) above, pp. 145-153. 

32. Bean, Louis H. Some interrelationships between the supply, price, and con- 

sumption of cotton. U. S. Dept. Agr., Bur. Agr. Econ., mimeographed 
report. April, 1928. 

33. Smith, Bradford B. Factors affecting the price of cotton. U. S. Dept. Agr. 

Tech. Bui. 50. 1928. 

34. Haas, G. C., and Mordecm Ezekiel. Factors affecting the price of hogs. U. S. 

Dept. Agr. Bui. 1440. 1926. 

35. Ezekiel, Mordecai. Two methods of forecasting hog prices. Jour. Amer. 

Stat. Assoc., 22, pp. 22-30. March, 1927. 

36. Hanau, Arthur. Die Prognose der Schweinepreise. Vierteljahrshefte zur 

Konjunkturforschung, Sonderheft 7. Institut fiir Konjunkturforschiing. 
Berlin February, 1928. 

37. Ezekiel, Mordecai. Factors related to lamb prices. Jour. Pol. Econ Vol 

XXXV, No. 2. April, 1927. 

38. Hedden, W. P., and Nathan Chbrniack. Measuring the melon market. Pre- 

liminary (mimeographed) report, U. S. Dept. Agr., Bur. Agr. Econ., in 
cooperation with the Port of N. Y. Authority, August, 1924. 

39. Kantor, Harry. Factors affecting the price of peaches in the New York City 

market. V. S. Dept. Agr. Tech. Bui. 115. 1929. 



REFERENCES 


439 


40. Waugh, Frederick V. Quality as a Determinant of Vegetable Prices^ pp. 3^ 

45. Columbia Univ. Press. 1929. 

41. Diedjens, V. A., W. D. Whitcomb, and R. M. Koon. Asparagus and its cul- 

ture. Mass. Agr. College Extension Leaflet 49. April, 1929. 

42. Howe, Charles B. Some local market price characteristics which affect New 

Jersey egg producers; factors influencing the retail prices of eggs. N. J. 
Agr. Expt. Sta. Bui. 1930. 

43. Benner, Claude L., and Harry G. Gabriel. Marketing of Delaware eggs. 

Del. Agr. Expt. Sta. Bui. 150. 1927. 

44. Kuhrt, W. J. a study of farmer elevator operation in the spring wheat area. 

Series of 1925-26. Part II. Analysis of the variation in the quality factors 
of the 1925 crop of spring wheat, and the relation of such variation to price 
received and premiums paid in 1925-26. U. S. Dept. Agr., Bur. Agr. Econ., 
preliminary report. October, 1927. 

45. Working, Holbrook. Factors influencing price differentials between potato 

markets. Jour. Farm Econ., pp. 377-398. October, 1925. 

46. Black, John D., and Edward S. Guthrie. Economic aspects of creamery or- 

ganization. Univ. Minn. Agr. Expt. Sta. Tech. Bui. 26. 1924. 

47. Schobnfeld, William A. Some economic aspects of the marketing of milk 

and cream in New England. U. S. Dept. Agr. Circ. 16, pp. 24-29. 1927. 

48. Warren, George F., and F. A. Pearson. Interrelationships of supply and price. 

Cornell Univ. Agr. Expt. Sta. Bui. 466. 1928. 

49. Ross, H. A. The demand side of the New York milk market. Cornell Univ. 

Agr. Expt. Sta. Bui. 459. 1927. 

50. Bean, Louis H. A simplified method of graphic curvilinear correlation. Jour. 

Amer. Stat. Assoc. December, 1929. 

51 . Demand and supply curves on potatoes and cotton. 1929. Unpublished 

manuscript, on file in Bureau of Agricultural Economics Library. 

52. Ezekiel, Mordecai. Statistical analyses and the “laws” of price. Quart. Jour. 

Econ., Vol. XLII, pp. 199-225. February, 1928. 

53. Smith, Bradfoiu) B. Forecasting the acreage of cotton. Jour. Amer. Stat. 

Assoc., Vol. 20, No. 149, pp. 31-47. 1925. 

54. Bean, Louis H. The farmer’s response to price. Jour. Farm. Econ., Vol. XI, 

No. 3, pp. 368-385. July, 1929. 

55. Ellioi't, Foster F. Adjusting hog production to market demand. Univ. III. 

Agr. Expt. Sta. Bui. 293. 1927, 

56. Cans, A. R. Elasticity of supply of milk from Vermont plants. Vt. Agr. 

Expt. Sta. Bui. 269. 1927. 

57. Reference (47) above, pp. 34-50. 

58. Go wen, John W. Studies on conformation in relation to milk producing ca- < 

pacity in cattle. Jour. Dairy Science, Vol. Ill, No. 1, January, 1920; Vol. 
IV, No. 5, September, 1921. 

Conformation and milk yield in the light of the personal equation of the 

daii-y cattle judge. Maine Agr. Expt. Sta. Bui. 314. 1923 

59. Wolfe, T. K. A biometrical analysis of characters of maize and of their in- 

heritance. Va. Agr. Eocj)t. Sta. Tech. Bid. 26. 1924. 

60. Richey, Frederick D. A statistical study of the nijition between seed-ear 

characb^rs and productiveness in corn. U. S. Di pt. Agr. Bui. 1321. 1925. 

61. Mensenkamp, L. E. Ability classification in ninth-grade algebra, 77ic Mathe- 

matics Teacher. January, 1929. 



440 


EXAMPLES OF CORRELATION ANALYSIS 


62. Gabrison, K. C. Correlation between intelligence test scores and success in 

certain rational organization problems. Jour. Applied Psychol. Decem- 
ber, 1928. 

63. Weeks, Angelina L. A vocabulary information test. Archives of Psychol. 

May, 1928. 

64. Hull, Clark L. Prediction formulae for teams of aptitude tests. Jour. Ap^ 

plied Psychol. Vol. VII, pp. 277-284. 1923. 

65. Hiqbib, EnaAR Creighton. An Objective Method for Determining Certain 

Fundamental Principles in Secondary Agricultural Education. Published 
at Madison, Wis., by the author. 1924. 

66. Sutherland, H. E. G. The relationship between I.Q. and size of family. Jour. 

Educ. Psychol. February, 1929. 

67. Freeman, Frank S. Power and speed, their influence upon intelligence test 

scores. Jour. Applied Psychol. December, 1928. 

68. Chauncey, Marlin R. The relation of the home factor to achievement and 

intelligence test scores. Jour. Educ. Res., Vol. XX, No. 2, 88. September, 
1929. 

69. Winch, W. H. Accuracy in school children. Does improvement in numerical 

accuracy ^^transfer”? Jour. Educ. Psychol., 1, 557-589. 1910. 

70. Goodenough, F. L. The Kuhlman-Benet tests for children of pre-school age. 

Univ. Minn. Institute of Child Welfare, Mon. Series 2. 1928. 

71. Witmer, Helen Leland. Attitudes of Mothers Toward Sex Education. IJniv. 

Minn. Press. 1928. 

72. Allport, Gordon W. The composition of political attitudes. Amer. Jour. So^ 

ciology, Vol. XXXV, 2, pp. 220-238. September, 1929. 

73. Spearman, C. A footrule for measuring correlation. British Jour. Psychol., 

Vol. II, p. 89. 1906. 

74. Yule, G. Udny. An Introduction to the Theory of Statistics, Chapters III 

and IV, pp. 25-27. Sixth edition. C. Griffin and Co., Ltd., London. 1922. 

75. Smith, Bradford B. Forecasting the volume and value of the cotton crop. 

Jour. Amer. Stat. Assoc., pp, 453-458. December, 1927. 

The use of interest rates in forecasting business activity. Proceedings of 

management week at Ohio State University, 1926. Published by Ohio 
State Bureau of Business Research. 

76. Crum, W. L. The statistical allocation of joint costs. Jour. Amer. Stat. 

Assoc., 21, pp. 9-24. March, 1926. 

77. Cowan, Donald R. G. The commercial application of forecasting methods. 

Jour. Farm. Econ., pp. 139-163. January, 1930. 

78. For comprehensive bibliographies of price analysis studies, see Louise 0. Ber- 

CAw. Price analysis. U. S. Dept. Agr., Bur. Agr. Econ., Bibliography 48. 
September, 1933. Price studies of the U. S. Dept. Agr. showing demand- 
supply, supply-price, and price-production relationships. U. S. Dept. Agr., 
Bur. Agr. Econ., Bibliography 58. October, 1938. (Both mimeographed.) 
Supplementary typewritten bibliographies covering later studies are also 
available from the Bureau of Agricultural Economics Library. See also 
F. L. Thomsen. Agricultural Prices. McGraw-Hill Book Co., Inc., New 
York. 1936; and Henry Schultz. The Theory and Measurement oj 
Demand. Univ. Chicago Press. 1938. 



REFERENCES 


441 ' 


79. Hearings before the Temporary National Economic Committee, Part 26. Iron 

and steel industry. A statistical analysis of the demand for steel, 1919-38, 
pp. 13,913-13,942. Washington. 1940. 

80. Roos, C. F., and Victor von Szeliski. Factors governing changes in domestic 

automobile demand. The Dynamics of Automobile Demand, General 
Motors Corporation. New York. 1939. 

81. Derksen, J. B. D. Long cycles in residential building: an explanation. 

Econometrica, Vol. VIII, pp. 97-116. October, 1940. 

82. Koopmans, T. Tanker Freight Rates and Tankship Building, Netherlands 

Economic Institute. London. 1939. 

83. Chamberlin, Edward. The Theory of Monopolistic Competition, Harvard 

University Press, Cambridge. 1936. 

84. Hearings before the Temporary National Economic Committee, Part 26. Iron 

and steel industry. Exhibit 1416, an analysis of steel prices, volumes, and 
costs — controlling limitations on price reductions, pp. 14,032-14,082. Wash- 
ington. 1940. 

85. Wylie, Kathryn H., and Mordecai Ezekiel. The cost curve for steel pro- 

duction. Jour. Pol. Econ., Vol. XLVIII, pp. 777-821. December, 1940. 

86. Dean, Joel. Statistical cost curves in various industries. Report of Phila- 

delphia meeting of Econometric Society, Dec. 27-29, 1939. Econometrica, 
Vol. VIII, p. 188. April, 1940. 

87. Girshick, Meyer, and Ruth O’Brien. Children’s body measurements for 

sizing garments and patterns. U. S. Dept. Agr. Misc. Pub. 365. 1940. 

88. Spearman, C. The factor theory and its troubles. I. Pitfalls in the use of 

probable errors. Jour. Educ. Psychol. 1932. II. Garbling the evidence. 
Jour. Educ. Psychol. October, 1933. III. Misrepresentation of the theory. 
Jour. Educ. Psychol, November, 1933. IV. Uniqueness of G. Jour, 
Educ, Psychol. February, 1934. V. Adequacy of proof. Jour, Educ. 
Psychol, April, 1934. 

. Analysis of abilities into factors by the method of least squares. 

Brit. Jour. Educ. Psychol., Vol. IV. June, 1934. 

89. Thttrstone, Tj. L. 77<,c Vectors of Mind, Mulliplr-fncJor Analysis for the Iso- 

lalioii of Primary Treats. Univ. Chicago Prt'ss. 1935. 

90. Bean, L. H. Ballot Behavior. American Council on Public Affairs, Washing- 

ton. 1940. 

91. Gosnell, H.uiold. Machine Politics, Chicago Model. Univ. Chicago Press, 

Chicago. 1937. 

92. Cassels, J. M., and W. Malenbaum. Doubts iiljout statistical suirply analysis. 

Jour. Farm Ecou., Vol. XX, No. 2. 1938. 

Mighell, R. L., and R. H. Allen. Supi)ly schedules — ^‘iong-time” and “short- 
time.” Jirur. Farm Econ., Vol. XXII, No. 3. 1940. 

93. Allen, R. If., Kkling Holm, and K. Ij. Mkjiiell. Supply respons(\s in milk 

l)roduction in Cabot-Marshfield, Vermont. IJ . a8. Dept. Agr. Tech. Bui. 
709. 1940. 

94. Jensen, Kinau. Determining input-output relationshi))s in milk jn-oduction. 

IJ. *S. Dept. Agr. Farm M anagcnicnl Reports, No. 5. January, 1940. 

95. Ezekiel, Mordecai. A check on a multiple corndatioii n'sult. Jour. Farm 

Econ., Vol. XXII, No. 2. 1940. 



CHAPTER 24 


STEPS IN RESEARCH WORK AND THE PLACE OF 
STATISTICAL ANALYSIS 

Relation of statistical analysis to research. Statistical analysis 
is only a tool to be nsed by the investigator. The analyst must be a 
worker in some field, or in several; he cannot use his statistical train- 
ing except in analyzing problems any more than a carpenter can use 
his skill without lumber and something to be made. Now that the 
routine of statistical analysis has been discussed, and the types of 
problems to which it may be applied have been surveyed, it is perti- 
nent to ask just what are the steps in research work and just where 
and how does statistical analysis fit into the picture. 

The research worker must have an adequate knowledge of the 
facts, technical and otherwise, of the field in which he is to work. 
This knowledge is usually insured by the situation that in most cases 
the worker is a biologist, an economist, a psychologist, or an agron- 
omist, first, and then a statistician only secondarily or in addition. 
When his training has been primarily in mathematics or statistics, 
however, the statistician must acquaint himself thoroughly with the 
facts and theories of the field involved before he can expect to do 
significant and substantial work. 

Stating the objective. If adequate acquaintance with the field 
is given, the first step in a particular research problem is setting up 
the objective of the project. The objective can best be stated in the 
form of a direct question, such as ''Why does lettuce sell for more 
on some days than on other days?'' The more exact and specific the 
question can be made, the more clearly is the field of the investigation 
defined. Thus if we make the question read "Leaf lettuce sold at 
retail in Boston" instead of merely "lettuce," the scope of the study 
is much more definitely indicated. Stating the objective as a ques- 
tion has the important effect of clarifying the issue, and so insuring 
that the worker knows what he is really trying to find out. It has 
the further effect of instantly challenging the attention and of in- 
stinctively calling forth mental answers which aid in the next step 
of the research. 


442 



DEVELOPING AN HYPOTHESIS 


443 


Any research project which cannot be stated as a definite question 
has not been clearly defined. Starting out merely ^'to collect figures 
on lettuce marketing” would not constitute research. Clear formu- 
lation of the question to be answered is an essential prerequisite of 
good research work. 

Developing an hypothesis. The second step in the development 
of the problem is a deductive analysis of the question raised to sug- 
gest possible answers. This deductive analysis draws on all the 
theoretical and practical training and experience the worker has. 
In addition, he may study previous work along the same lines, ask 
questions of those concerned in the industry, or make brief recon- 
naissance studies to decide on the factors which may be involved 
and’ to judge of the probable relationships. This phase of the re- 
search should lead to the setting up of a definite hypothesis as to the 
elements which will be involved and of the ways in which they will 
be related. Thus in the lettuce problem, the hypothesis might be 
that the supply of leaf lettuce was the most important factor deter- 
mining the price and that the larger the supply, the lower the price; 
that the supply of Iceberg lettuce also influenced the price of leaf 
lettuce, large supplies of Iceberg tending to depress the price of leaf 
lettuce; that weather affected the demand, prices for the same supply 
being higher in hot weather than in cool ; that prices of other vegetables, 
such as tomatoes and cucumbers, might also influence lettuce prices, 
either as competitive products tending to depress lettuce prices when 
their prices were low or as complementary products tending to raise 
lettuce prices when their prices were low. It might further be sup- 
posed that variations in the purchasing power of consumers would 
affect the demand and that changes in the general price level of food- 
stuffs would also have some effect. Finally, it might be supposed that 
the demand would vary regularly from day to day through the week, 
owing to the purchasing habits of consumers, and from time to time 
through the year. 

The process of developing the hypothesis may be aided by break- 
ing up the main question to be answered into a number of sub- 
questions, each one of which may be further broken up. Thus the 
initial lettuce question may be broken up into questions such as “Do 
(the specified prices) vary because of supply? Because of demand? 
Supplies of what? Leaf lettuce? Iceberg lettuce? Competing products? 
What are competing ])roducts? What makes demand? Weather? 
Purchasing power? vSeasonal factors?” and so on until complete de- 
tails have been thought out for every phase. 



444 


CORRELATION ANALYSIS IN RESEARCH WORK 


In setting np the hypothesis the investigator should also attempt 
to think through the probable nature of the relationships. Thus, 
should it be assumed that the influence of supply of leaf lettuce on 
price will be constant, independent of other factors, or is the relation 
likely to change from time to time through the year, or from day to 
day with the weather? 

In setting up his hypotheses, the investigator not only should 
rely on his own knowledge but also should draw upon all the skill and 
knowledge of others who have experience in the same field. This will 
involve not only a careful study of earlier investigations of the same 
problem but also discussions with practical men who are operating 
in the field to be studied. Thus the student of lettuce prices should 
talk with wholesale produce merchants, retail grocery men, farmers 
producing lettuce, and even chefs and housewives, to get their opinions 
of the factors influencing lettuce prices. This will enable the student 
to check his hypotheses against the ideas of practical men dealing 
with the same problem, and often may call to his attention elements 
in the situation which otherwise he might completely overlook. 

Measuring the factors. Once the hypothesis has been set up, 
and the various factors enumerated in it have been considered with 
much care to make sure that every important element has been in- 
cluded, the next ’step is to secure measurements of the various factors 
to be studied. This will involve deciding whether the data are to be 
taken from published records or other secondary sources, or whether 
they are to be secured first hand. If first-hand collection is decided 
upon, further detailed study is involved as to where the ultimate 
facts are, who has knowledge of them or records of them, and how 
they are to be collected — by measurement, by direct observation, by 
enumerators, by schedules, by mail questionnaires, etc. Extended dis- 
cussions of the advantages and disadvantages of each method, and 
the problems involved in laying out a record form, defining tlie units, 
securing the records, and checking or editing the reports are available 
in standard statistical textbooks^ and will not be repeated here. 

1 Arthur L. Bowley, Elements of Statistics, Chapters III and VIII, pp. 18-57, 
178-195, fourth edition, P. S. King & Son., Ltd., London, C. Scribner’s Sons, New 
York, 1920. 

Hor.\ce Secrist, Introduction to Statistical Methods, pp. 22-52, 65-71, The Mac- 
millan Co., New York, 1917. 

Harry Jerome, Statistical Method, pp. 13—23, Harper and Brothers Now York 
1924. 

William L. Crum and Alson C. Patton, An Introduction to the Methods oj 



STUDYING THE APPARENT RELATIONS 


445 


Precautions also need to be observed if secondary sources are used; 
these precautions also are well discussed in the references just given. 
Only one point will be developed here, and that is the special need of 
completeness in the records, particularly if original data are to be 
secured. Once an enumeration or observation has been made, addi- 
tional data can be secured only at much extra trouble and expense, 
or in many cases cannot be secured at all. That is one reason why 
the hypothesis must be carefully studied beforehand to make sure all 
relevant factors are included, and why the preliminary study and 
investigation are so important. Factors which are stumbled upon or 
which suggest themselves in the later analysis may be of value in 
subsequent studies of the same type, but if the essential data are lack- 
ing the suggestions are too late to be of any value in the current study. 

In obtaining the basic data it is necessary to decide on the par- 
ticular items to be measured to represent the hypothetical factors. 
Are the weather elements to be rainfall, or wind, or temperature? 
If temperature, average, or maximum, or minimum? If average, what 
kind of average? And so on through a lengthy number of details, 
each one of which must be carefully considered in view of the hypo- 
thetical significance of the factor, the probable relations involved, and 
tlie effect which is expected to be shown. 

Studying the apparent relations. After numerical values are 
available for all the elements, the next step is to make a thorough 
study of the apparent relationships before proceeding to more elabo- 
rate analyses. Both the relation of the independent factors to each 
other and tlie relation to the dependent must be studied, for, as has 
been pointed out before, tlie relation of the d('pentlent factor to an 
indpendent factor that is not related to the otluu's can lie determined 
liy simiile correlation, whereas otherwise multiiile correlation might 
be necessary. (This does not hold, however, if joint functions are 
jiresent.) It is at this point that the investigator begins to test out 
the various elements in the hypothetical ])icture and to compare the 
hypothesis with the observcal facts. 8nni(‘ elements which were 
thought to he of importance may prove unrelated, and other variables 
which w(U’C thought of doubtful significance may show imiiortant 
relations. This jireliniinary examination may (‘von jirove the entire 
hypothesis to be wrong and necc^ssitate a re-examination of the basic 

Economic Hiaiktici^, ('ImpU'rs II, HI, IV, i)i). A. W. Rhnw Co., ChicaRO and 

Now York. 1925. 

Fhkdkkick Vj. Ciiox'i'oN and Dudi.ky J. Cowdkn, Applied Ooieud l^laiisticti, 
Cliapt(‘r II, PI). ir)“-48, Prontico-llali. Inn.. New York. 1030. 



446 


CORRELATION ANALYSIS IN RESEARCH WORK 


ideas and a reformulation of the proposed explanation more in line 
with the facts as observed. 

Running a Correlation Analysis 

The preliminary examination of the data will provide the basis 
for setting up the final multiple correlation analysis, if the inter- 
relations are such that such an analysis is finally found to be needed. 
As included in this analysis, each variable will have a definite place in 
the hypothesis, and some specific kind of relation will be expected to 
be found when the analysis is completed. Looked at in this way, the 
correlation analysis is not the whole of the research project, but is 
merely that portion of it in which the adequacy of the theoretical 
hypothesis is tested and in which the exact relations, as expected in 
the hypothesis, are measured and determined. 

Units in which variables are stated. Once the variables to be 
employed in the final statistical analysis are selected, the next prob- 
lem is to decide in what units to state them. In studying land values, 
for example, the value of a given farm may be stated as total value, 
as value per acre of all land, or as value per acre of improved land. 
Which one to select depends on what other variables are included 
and how they are to be stated. The total value of the farm might 
be correlated with the value of the dwelling, the value of other build- 
ings, the acres in cultivated land, the acres in pasture, etc. This 
would tend to show the contribution per acre of each of the acreage 
elements and should give a high correlation, since under normal con- 
ditions the value of the farm may be expected to approximate the 
value of the buildings plus that of the several tracts of land. In 
this case the simple or additive regression equation would be quite 
appropriate, for it would give 

Farm value = value of dwelling d- value of other buildings 

+ (value per acre of cultivated land) (number acres of culti- 
vated land) 

+ (value per acre of pasture land) (number acres pasture land) 
+ (value per acre of woodland) (number acres woodland) + etc. 

But if it were desired to measure the influence of type of road, fer- 
tility of land, and distance from town on land value, they could not be 
so readily included in the same additive equation. For example, a 
40-acre farm yielding 40 bushels of corn to the acre might be worth 
on the average $1,000 more than a farm of the same size yielding 30 



TYPE OF EQUATION TO BE FITTED 


447 


bushels of corn per acre. Under the same conditions, it would not be 
reasonable to expect that a 160-acre farm yielding 40 bushels of corn 
to the acre would be worth only $1,000 more than a 160-acre farm 
yielding 30 bushels per acre. In the first case, the higher yield would 
add $25 per acre to the farm value, in the latter, only $6.25. Yet if 
yield of corn were added as a factor to the above equation, that would 
assume that a given increase in fertility would add the same amount to 
the value of the farm, no matter how large or how small the farm was. 

If the value were stated as value per acre, that would partly solve 
the difficulty, for a given change in fertility, distance from town, or 
type of road would then be assumed to have the same influence upon 
value per acre no matter how large or how small the farm was. But 
that would introduce difficulty with other variables. The dwelling, for 
example, would not become larger in direct proportion to the size of 
the farm. Very large farms with good dwellings would have a very 
low “value-of-dwellings-per-acre,” and small farms with poor dwell- 
ings would also have a low ^Value-of-dwellings-per-acre.” Only some 
method of determining the effect of value of dwellings on land values 
separately for farms of different sizes would take care of this difficulty, 
as otherwise the same measurements would be used for dwelling values 
which might be different in their effect on land values, with consequent 
confusion of the results.^ 

Type of equation to be fitted. The case mentioned also illus- 
trates the need of something other than a simple additive regression 
equation to express certain cases. If it is assumed that the more fer- 
tile the farm, the greater the effect of nearness to town would be, and 
that the nearer to town, the greater the effect of an increase in fertility 
w^ould be, that could not be adequately expressed by the regression 
equation 

Value per acre = /(distance) 4- /(fertility) -h etc. 

The multii)lying effects of the two variables u[)on the value could be 
allowed for by using the equation 

Value per acre — [/^ (distance)] [/ 2 (fertility) ] [/(etc.)] 
which, for the actual i)rc)cess of computation, can be stated 
logarithm (value per acre) 

= (log distance) -f <r>o (log fertility) + (log etc.) 

2 Sen the appendix, pap;es 30-54, of U. S, Dept. Agr. Bui. 1400, Factors affectinji; 
farmers’ earnings in .southwest Pennsylvania, for an example of slati.sticinl treat- 
ment of a problem of this type. 



448 


CORRELATION ANALYSIS IN RESEARCH WORK 


This logarithmic equation, which puts the relations on a relative or 
proportional rather than an absolute or arithmetic base, is a very flex- 
ible one and one that can be used in a great many types of problems. 

Finally, if the effect of fertility upon land value be found to vary 
with fertility, say, and the effect of building value with size of farm, 
not even the logarithmic equation would be applicable. Instead, an 
equation of the joint-function type (note Chapter 21) might be used, 
such as 

Log (value per acre) = /(distance, fertility, roads) 

-f /(value dwelling, size of farm, value barns) 
+ etc. 

One further consideration is the danger of false results or spurious 
correlation if the variables are improperly stated. Thus if an attempt 
were made to correlate the value of farms with three factors, (A) the 
percentage of land in corn, (B) the percentage of land in wheat, and 
(C) the percentage of land in all other uses, it would be impossible to 
solve the problem, or else it would give a spurious result. That is be- 
cause the factors (A), (H), and (C) would add to exactly 100 per cent 
in each individual case, and after variation in (A) and (B) had been 
held constant by statistical means, there would not be any room left 
for variation in (C) . Even if as the result of rounding off the variables 
there were slight deviations from the 100 per cent total, the results 
would have little significance, as the practically perfect intercorrelation 
between the three independent factors would make the measures of 
their net influence, both regression coefficients and net coefficients of 
correlation, exceedingly subject to error.^ Only by dropping out one 
of the factors, say (C), would significant results be secured. The re- 
gressions on (A) and (B) would then also show the effect of (C ) ; for 
example, the increase in value for each unit increase in (A) would 
mean the increase due to substituting one unit of (A) for one unit of 
(C) ; changing the sign would give the effect of substituting one unit 
of (C) for one of (A) . The same principle would then apply as be- 
tween (B) and (C) ; whereas the increase in the dependent variable 
for substituting one unit of (B) for one of (A) would be the difference 
between the two net regression coefllcients. 

^ For an extended mathematical treatment of this problem, see Ragnar Frisch. 
Statistical confluence analysis by means of complete regression systems, Oslo Uni- 
versity 0konomiske Institutt. Publikazion No. 5, (1934). 



STEPS IN CARRYING THROUGH THE COMPUTATIONS 449 


After the variables to be examined and the nature of the regression 
function to be used have been decided upon, at least tentatively, it is 
necessary to decide what type of curves are to be fitted. If mathe- 
matical regressions are to be used, this involves deciding what form of 
equation is to be used. (Note pages 76 to 125 of Chapter 6, and 397 
to 401 of Chapter 22.) If curves are to be fitted by one of the graphic 
methods, limiting conditions to be applied in fitting the curves must 
be worked out, in the light of the hypotheses stated and of the tech- 
nological and other knowledge of the relations. (See Chapter 6, pages 
109 to 110, Chapter 14, page 224, and Chapter 16, pages 278 and 279.) 

Steps in carrying through the computations. After the variables 
and the form of the equation for the statistical analysis have been de- 
cided upon, the next step is actually carrying through the computation. 
This involves ^'coding^^ the numerical values of the variables, that is, 
reducing them to simpler terms for ease of handling; calculating the 
extensions; setting up and solving the normal equations; and calcu- 
lating the standard error of estimate, the coefficient of multiple corre- 
lation, and possibly the coefficients of separate determination or of 
part correlation. Then if curvilinear regressions are desired, the re- 
siduals from the linear regression equation will be computed, and the 
net regression curves determined by successive approximation (or by 
the graphic short-cut method if the conditions are favorable) . After 
the final curves are determined, the standard error of estimate for the 
curvilinear regressions and the index of multiple correlation are com- 
puted. If joint functions are suspected, the residuals are grouped with 
respect to two or more variables, or studied with respect to compound 
variables of the Court type, until by successive approximations the 
final shape of all the functions, simple or joint, has been determined, 
and the new standard error of estimate and index of correlation com- 
puted. As a final step, the standard error of each of the regression 
coefficients, or of eacli portion of each regression curve, should be com- 
puted and indicated on the regression charts, to indicate the signifi- 
cance to be attached to the results. The standard error of the correla- 
tion coefficient or other constants likewise should be determined. All 
through the process, the statistical relations found should be checked 
back against the hypothetical expectations. If the statistical results 
conflict with the hypothesis, both should be re-examined to see where 
the conflict lies, as discussed in more detail subsequently. 



450 


CORRELATION ANALYSIS IN RESEARCH WORK 


Meaning of Correlation Results 

It must be noted, however, that a statistical determination of the 
nature of any relation, no matter how complicated the methods used 
in making the determination or how flexible the type of function al- 
lowed for, tells nothing of the reason for the relation observed. 

Thus the variation in potato yields with differences in early and 
late rainfall, as determined in Chapter 22, may be due to a large va- 
riety of different causes. The plant requires certain conditions of soil 
moisture, nutrients, sunshine, maximum and minimum temperature, 
and relative humidity to make the best growth, and the factors used 
reflect certain of them. Further, it may be that one set of conditions 
is required during the first part of the growing period while the plant 
is developing its leaves and top, and another set later on while it is 
developing the tubers; and that the rainfall factors used relate in this 
way to the growth periods of the plant. 

There are other possibilities, however. The yield of a plant is af- 
fected by the weather conditions not only as they directly affect the 
development of the plant itself but also as they affect the development 
of insects and diseases that prey on the plant. For example, the pe- 
culiar relation of potato yields to early and late rainfall considered 
jointly, as shown in Figure 74, might reflect the relation of late rain- 
fall to potato diseases. With 16 inches of early rainfall and 3 inches 
of late, a yield of 240 bushels would be expected; with the same early 
rainfall, as the late rainfall increases, the probable yield declines until 
with 6 inches of late rainfall it is under 180. The heavy early rainfall 
may stimulate good growth of the top; then if heavy late rainfall 
should follow it might result in conditions favorable to the develop- 
ment of potato blight, and so reduce an otherwise promising yield. 

It is evident that a considerable range of specific technical infor- 
mation is necessary to interpret correctly the results of a correlation 
analysis, and to develop the reasons for the particular relations which 
have been found to exist. For best results this technical knowledge 
must be drawn upon in the early stages of the investigation, to aid in 
selecting and stating the variables to be considered in such a way that 
the functional relations, when found by appropriate statistical means, 
would adequately represent the technological elements present and so 
be capable of a logical technical interpretation. The correlation anal- 
ysis itself can never provide the interpretation of cause and effect. It 
can only establish the facts of the relations — for the meaning of those 
facts, the investigator must look elsewhere. 



STATEMENT OF RESULTS OP CORRELATION ANALYSIS 451 


The way in which correlation analysis establishes the facts of rela- 
tionship and nothing else may be illustrated by a specific example. If 
the number of automobiles moving down Sixteenth Street in Washing- 
ton, D. C., for each 15-minute period through a given 12. hours is cor- 
related with the height of the water in the Potomac River during each 
of the same periods, a definite correlation will be obtained. On some 
days this correlation would be so high that its probable error would 
indicate that it would be very unlikely that it could have occurred by 
chance. However, if on the basis of this correlation one were to at- 
tempt to forecast the flow of traffic from the height of the water, he 
would find his forecast sadly in error if he made it for another day 
when the street was closed for traffic repairs, when the water was high 
because of a flood, or when the moon was in a different phase. This 
is a case in which it is perfectly obvious that there is no direct causal 
relation between the two phenomena. Yet there is real correlation 
between them because they both are influenced, though very remotely, 
by the same sequence of cosmic events. The rising and the setting of 
the sun have a very definite influence on the movements of persons and 
therefore on the flow of traffic, whereas the rising and the setting of 
the moon likewise have a definite influence on the height of the water. 
Washington is so close to the ocean, and has so low an elevation, that 
the Potomac River has a definite ebb and flood of tide. There is a 
certain specific though complex relation between the rising and setting 
of the sun and of the moon. This relation is changing constantly from 
day to day. This illustrates a case in which real and significant cor- 
relation between two variables reflects causation by a common factor 
or factors, yet gives no inference as to direct causal connections. Many 
similar cases are met with in practical work in which the correlation 
between two variables is due to both being influenced by certain com- 
mon causes although neither may in any conceivable way influence the 
other. This illustrates again the need for clear, logical thinking and 
for a technological basis for the interpretation of the statistical results, 
which can measure the relationships but of themselves can tell nothing 
of cause or effect. 

Statement of results of correlation analysis. Having comideted 
the statistical analysis of the relations — the extent and complexity of 
which will dc])en(i upon the nature of the problem, the number of ob- 
servations available, the importance of the relations, and the facilities 
available with which to work — the next step is to translate the sta- 
tistical results to intelligible non-technical statement. This may go 
only so far as simple regression charts or estimating tables of the type 



452 


CORRELATION ANALYSIS IN RESEARCH WORK 


shown at the end of Chapter 13, or of carefully worked-out pictorial 
statements such as shown in Fig. 75. After the results are reduced to 
intelligible form — intelligible, that is, at least to the investigator — 
they should be carefully compared with the original hypothesis. If 
hypothesis and the statistical results do not agree, the hypothesis must 
be carefully examined to see if it may logically be restated so as to be 
consistent with the facts as found; and the analysis must be carefully 
studied to see if there are any loopholes in the way the facts are stated, 
or in the way the problem has been worked through, which may be 
responsible for the results. (The preliminary results cited at the top 
of page 419, in Chapter 23, are an example of mis-statement of the 
variables.) If the hypothesis and results are found to be consistent, 
or if, without doing violence to either, they can be brought into reason- 
able agreement, the project may be regarded as completed. If such 
agreement is not obtained, the results may be announced as actual 
observations inconsistent with what was expected and subject to fur- 
ther study or independent checks before being accepted as scientific 
conclusions. 

Finally, if forecasts of future events or estimates for new observa- 
tions are to be made from the results of the analysis, the methods 
outlined in Chapter 19 should be used to help judge how much con- 
fidence can be placed in such estimates or forecasts. 

When the hypothesis and the analysis are found to be in satisfac- 
tory agreement, all that remains is to interpret the results to those 
who will be interested in them and have to use them. At this point 
many investigators fail to take into account the audience for which 
they are writing. If they are writing a technical paper for a scientific 
journal, a full discussion of the methods and techniques used, statis- 
tical and otherwise, will be quite in place, so that their fellows may 
pass on the adequacy of the work. If, instead, they are writing a gen- 
eral or a popular report for an audience which is only interested in 
what they have discovered and what it means, details of statistical 
technique may be as out of place as computations of stresses and 
strains would be in a magazine devoted to ''The Home Beautiful.’’ 
The plethora of technical terminology in some supposedly popular 
reports of statistical investigations has led the readers to suspect that 
the investigator himself did not understand what his results really 
meant. Unles the conclusions can be translated back into "the King’s 
English, and stated so simply that practical men dealing with the 
problem investigated can understand what the results mean, the use- 
fulness of the research may be largely wasted. 



SUMMARY 


453 


Summary. The place of statistical analysis in scientifLo research 
is no different from the place of any other technical aid the investigator 
may employ. It furnishes a means of measuring the elements that are 
involved and of examining the way in which they are related; but it 
does not of itself furnish an explanation of phenomena. Except insofar 
as the effort to reduce the variables to specific numerical statement, 
definitely related, forces the investigator to think more clearly and 
definitely about his problem, statistical analysis is not a substitute 
for logical analysis, clear-cut thinking, and full knowledge of a prob- 
lem. The methods of analyzing complicated relations set forth in this 
book furnish the student keen tools for investigating complex rela- 
tionships; but, like all keen tools, they may yield unsatisfactory or 
misleading results if employed carelessly or heedlessly. Statistical 
analysis is not a substitute for careful thinking and skilled workman- 
ship in research work; instead, it is an aid which may make that 
thought and skill even more productive of worth-while results. 




APPENDIX 1 

METHODS OF COMPUTATION 

Coefficients of correlation and regression. Many of the operations described in 
the text may be performed much more rapidly by short-cut methods. One such 
method, for computing cx has already been given. (Note 1, Chapter 1.) When the 
value required in the normal equations, is desired instead, that may be calculated 
from a frequency table by the same method, by use of the relation 

Sr* = - n (100) 

A similar short cut may be used in computing the product sums, Sxy, required 
to determine coefficients of regression or correlation. The first step is to construct 
a double-frequency table. Such a table is known as a correlation table, since it 
shows the nature of the relation between the two variables in much the same way 
that a correlation chart or dot chart does. The following correlation table. Table 86, 
is prepared from the haystack data used as an illustration in Chapter 21. 

First the number of items falling in each subgroup is determined. Then the 
entries in each row are summed, giving the total frequencies with respect to Xi. 
These frequencies, denoted Fi, are shown at the right in the table, in the column 
headed “All values of The entries in each column are similarly added, and 

the totals entered at the foot. These entries give the frequency distribution with 
respect to X 2 , and are, therefore, denoted / 2 (frequencies of X 2 ). A central group is 
then selected for each variable, and the departures of the other groups, above or 
below that central group, are shown in the “di” and “d 2 ” column and line, respec- 
tively. The usual extensions to compute the for each variable are shown in the 
final columns, diFi and d'lFi, and lines ^ 2^2 and (I 2 F 2 . As their designations in- 
dicate, the entries under these heads are obtained by first multiplying the Fi entry 
by di, giving diF\, and then multiplying that again by di to give d^Fi. The similar 
computations for F 2 arc shown at the foot of the table. 

The new step in the table is the incorporation of the column Xd^Fi and of the line 
XdiF. The entries in the column XdoF show for each line the sums of the frequencies 
of each cell in that line multiplied by the ^2 values for each cell. Thus for the first 
line, the single entry has a d^ value of —3, so the entry in Xd^F is —3. The next 
line similarly has a single entry in the —2 column. The fourth line, though, has the 
following frequencies; 1 in the —4 column, 1 in —3, 2 in —2, 8 in —1, 4 in 0, and 1 
in 1, The respective products, —4, -3, —4, —8, 0, and 1, add to —18, and this 
value is, therefore, entered in the 2 ^ 2 ^ colninn. 

The XdiF line is similarly computed, showing the sum of the frequencies in each 
cell for each column, multii)lied by the corresponding di values. Thus the first 
column has 1 entry in the “1 line; the second, 1 in the —4, and 1 in the ~1 with 
the sum —5. The third column has 1 in —3, 1 in —2, 2 in —1, 6 in 0, and 3 in 1. 

455 









COEFFICIENTS OF CORRELATION AND REGRESSION 457 


The products —3, —2, —2, 0, and 3 add to —4; and this is the value for the ScfiF 
entry for that column. 

After all the 2^2^ entries are made, each is multiplied by the di value for the 
same line, giving the values entered in the di(Zd 2 F) column. Similarly, the entries 
in the 2diF line are then multiplied by the corresponding ^2 values, and the products 
entered in the d^C^diF) line. Each line at the foot of the table is then summed, and 
the sums entered at the right of the line; and each column at the right of the table 
summed, and the sum entered at the foot of the column. 

The arithmetic may now be checked by the following identities: 

'ZFi = 2^2 ( = 120, in the illustration) 

2d2F = 2d2F2 (-24) 

2diE -2diFi ( = 120) 

2di(2d2F) =2de(2diF) (=244) 

As a matter of fact, the lines 'ZdiF and d2(2diF) may be omitted, if this check is 
not to be made. Where many items are involved, however., this is a rapid and 
accurate check. Only the extensions and dfFi are then not checked, and these 
may be verified readily by recomputing. 

The next step is to adjust the values in terms of departures from the assumed 
means to terms of departures from the true means: To do this in organized fashion, 
the five sums are entered in order, as shown at the foot of the tabulation. The first 
two items, 2diF and 2 ^ 2 ^, are each divided by the number of cases (120) to give the 
average departure (in terms of class intervals) from the assumed moan grouj), that is, 
values for M^i and A/jo. 

The correction factors are then computed as follows: 

■Zx\ = SrffFi - Q:diF){AU,) = 434 - (120) (1.0) = 314 
2x1 = 2dlF2 - (Z,kF){Mj,) = 300 - (24) (0.20) = 295.2 
2x1X2 = ^ZdidiF - (2diF){Mj,) = 244 - (120) (0.20) = 220 

This process is shown in tabular form for each item. 

The coefficients of regression and of correlation are tlieu comput(‘<l th(‘ usual 
formulas, as shown at the foot of the table. 

The coeffiicient of regression shown in the tabulation is of course in t-cnans of 
group units. That is, the value bi 2 — 0.7 means that for every change of one group 
interval in A "2 there is on the average a change of 0.7 group intervals in Xi. Since 
the group intervals are 0.020 for X 2 , and 0.080 for Ai, this does not apply to the 
actual Xi and X 2 values. Instead, correction must be made as follows: 

Regression of Xj on X 2 in terms of original units = regression in terms of group 

/ Group interval of 

units I — ; J 

\ Group interval of A' 2 / 

In this case, 

by. (for A'l and Ao) - 0.7000 = 2.8024 

0.020 

Hence for eacl' (•hnng(‘ of 1 unit in A' 2 , A’l (‘hauges 2.8 units, on the average. 



458 


APPENDIX 1 


Before aii can be calculated, it is necessary to have the means of and X 2 . 
The assumed mean of Xi, at the midpoint of the group 0.340-0.419, would be at 
0.380. Since the mean of di is 1.00, the true mean lies exactly one group interval, or 
0.080, higher than this, or at 0.460. Similarly, the assumed mean of X 2 , at the mid- 
point of the group 0.125 — 0.144, is at 0.135. Adding to this the mean of d% or 0.20, 
times the group interval of 0.020, gives 0.139 as the mean of Xi. The value of ai 2 
may now be calculated by formula (10) : 

a — Ml — 5x2^2 
= 0.139 - (2.8024) (0.0460) 

= - 1.150 


From the values shown in the table, the regression equation is, therefore, found 
to be 


Xi =- 1.150 + 2.8024X2 


The uncorrected correlation coefficient, ri 2 , has been found to be 0.722, as shown 
in the table. In making this computation, however, no allowance has been made 
for the tendency of grouping to exaggerate the departures of the individual cases 
from the means, which affects and Xxl, but does not affect 'LxiX 2 - This may 
be allowed for by applying Sheppard’s correction to ^x\ and (See Note 1, 

Chapter 1.) Since the correction is (corrected (P) = , the correction for 

Sxf == SaJi *“ TT * If we apply this correction to both and Sa;!, in the formula 


for ri 2 , the value of ri 2 comes out 0.747, a definitely higher value. In this case only 
9 groups have been used for X 2 and only 10 for Xi, so the correction for grouping is 
important. In most practical work, at least 20 to 30 groups should be used for each 
variable; and when that is done, the application of the correction for the fineness of 
grouping becomes of much less significance. 

Although the correlation computed with Sheppard’s corrections (and adjusted 
for the number of observations by equation [25]) gives the best estimate of the true 
correlation in the universe from which the sample was drawn, the formulas for the 
standard error of correlation coefficients are all based on the uncorrected formula. 
If any test of the significance of the observed correlation, such as Fisher’s z-trans- 
formation, is to be applied, the unadjusted value should be used. 

The regression coefficient, as well as the coefficient of correlation, is changed 
slightly if Sheppard’s correction is applied. Thus, using the correction. 


&12 


"2X1X2 


220 




Tic^ 120 


= 0.724 


Just as with the correlation coefficient, the larger the number of groups, the less 
influence the correction has on the calculated values. With 30 or more groups, it 
is ordinarily neglected. 

There are many other ways in which the coefficient of correlation may be cal- 
culated. Thus it may be shown (Note 1, Appendix 2) that if 


X 3 == Xi - X 2 



MULTIPLE COERELATIOJSr AND NET REGRESSION 


459 


the standard deviations are related according to the formula 

erf “ erf — 2ri2(Jicr2 + o’! 

Hence the correlation between Xi and X2 may be found by calculating the difference, 
Xz, between each pair of values for the two variables, and computing the standard 
deviation of each of the tlnee series. The correlation is then given directly by the 
formula 2,2 2 

O’! “h - cr-z 

Coefficients of multiple correlation and net regression. When many variables 
are to be considered, and a large number of observations are available, the necessarjr 
extensions for multiple correlation fines or curves fitted by the least-squares method 
may be made most readily by the use of tabulation cards, one for each observation, 
with the values for each variable entered on each card. If mechanical or electrical 
tabulation equipment is available, the values may be designated by punched holes. 
The cards can then be sorted and tabulated by automatic machines. 

TABLE 86 

Computation of Extensions by ’^'Diqiting^^ and Accumulative Totaling 


Value 
of Xi 

Num- 
ber of 
items 
( 1 ) 

Accumu- 
lations of 

( 2 ) 

2;(Ai) 

(3) 

Accumu- 
lations of 
(Ah )2 
(4) 

S(X2) 

(5) 

Accumu- 
lations of 
X 1 X 2 
( 6 ) 

SCX3) 

(7) 

Accumu- 
lations of 
XiXz 
( 8 ) 

2(^4) 

(9) 

Accumu- 
lations of 
^ 1 X 4 
( 10 ) 

30-39 

2 

20 

09 

GOO 

11 

110 

1 

10 

5 

60 

20-29 

0 

20 

0 

690 

0 

110 

0 

10 

0 

50 

10-19 

20 

220 

295 

3,640 

32 

430 

18 

190 

26 

310 

0 -- 9 

10 

(320) 

GO 

(.1,300) 

14 

(570) 

10 

(35(,)) 

23 

(540) 

_<) 

3 

3 

47 

•17 

2 

2 

5 

5 

2 

2 

-8 

2 

6 

28 

75 

3 

5 

7 

12 

3 

5 


0 

5 

0 

75 

0 

5 

0 

12 

0 

5 

-0 

1 

6 

21 

96 

12 

17 

2 

14 

3 

8 

-5 

13 

19 

183 

279 

20 

37 

8 

22 

14 

22 

-.■1 

5 

24 

70 

349 

(’) 

43 

3 

25 

7 

29 

-3 

«> 

20 

11 

300 

7 

50 1 

1 

26 

3 

32 

— 2 

1 

27 

7 

3t)7 

2 

52 

2 

28 

2 

34 

- 1 

3 

30 1 

i 1 1 

408 

2 

54 

r, 

33 

14 

48 

-0 

- 

(32) 

22 

(130) 

3 

(57) 

- 

(3.5) 

6 

(5'1) 

Sums 


1 

= 405 


‘■i 7070 

2;.YiX2 - 915 

IJYi.Ya - 387 

iJYi.Y 4 « 595 


The most rapid mcdliod of calculating extensions, where card-tabulating 
equipment is available, is by a combination of the ^Tiigiting” method with cumula- 
tive addition. Thus if foiii’ varialiles a, re being considered, tlio extensions for 
SXf, SA 1 A 2 , 2AbA’:{, and wA’iA",! would be stMuirod by sorting on Ah- If each variable 
were tabulated in two digits (0 to 91)) tlie cards would first be classified in ten groups 
from 00 to 90 on the tens column of Ai a,nd the total for each variable computed. 
The cards would then be reclassified into ten groups from 0 to 9 according to the 
values in the digit column, and the total for each variable computed. The totals 
would then be entei'ed as shown in Table 86, starting with the highest value and 
running down to the smallest. 



460 


APPENDIX 1 


After the number of items and the sums for each group are entered (columns 1, 
3, 6, 7, and 9) in the table the columns headed “accumulations” are computed as 
foUo'Ws: The first item, times 10, is entered in the top line. Thus the first item in 
column 1 is 2, so 20 is entered in column 2. The second item in each column is 
then multiplied by 10, added to- the first item in the accumulation column, and the 
total entered on the second line of the accumulation column. (Since the second item 
in column 1 is 0, the second item in column 2 is 20 + (10) (0) = 20.) The third item 
in each odd-numbered column is next multiplied by 10, added to the second item in 
each adjoining accumulation column, and the total entered on the third line of each 
accumulation column. (The third item in column 1 is 20, so the third item in column 
2 is 20 + (10) (20) = 220.) The same operation is performed for the next line. 
When this process is completed for all the classifications of Xi by tens, it is begun 
afresh for the classifications by digits, without multiplying by 10. The item in the 
“ — 9” class, 3, of column 1, is entered in the adjoining accumulation column. The 
next item, 2, is added to it, and the total, 5, entered in the accumulation column, and 
so on down the column, entering in column 2 the accumulated total of column 1, 
The same operation is performed in each of the other accumulation columns, each 
showing the accumulated total for the column to its left. 

In each case the accumulated total for the —’0 group is one-tenth that for the 
0-9 group, and is equal to the sum of the particular variable, checking all the 
computations. That is because each value appears twice: once when the observa- 
tions are classified according to the first digit of Xi (in the tens column), and once 
when they are sorted according to the last digit of Xi (in the units column). If 
there were also a hundreds column, there would be a third sort for that, and the 
accumulative totals for the hundreds groups, with 00 added at each step, would be 
100 times as large as for the unit groups. After the entries in the -0 line have been 
checked against those in the 0-9 line, both are enclosed in parentheses. All 
the entries in each accumulation column are then added, except those enclosed in 
parentheses. The totals for each column are then the extensions for SXi, SXi, 
SX 1 X 2 , etc.^ 

By the use of this method, each variable can be carried to 3, 4, or even more digits 
if desired, yet the extensions be obtained with exactly the same precision as if each 
individual item were extended separately. The work is very greatly reduced; the 
extensions, even if 3 digits were used for each variable, requiring at the most only 
30 lines, or for 4 digits, 40 lines. 

^ It is easily seen why this is so. Each item in the 30-1- group appears 3 times in 
the accumulation column, times 10 each time; it, therefore, contributes 30 to the 
total. Likewise, each item in the —9 column appears 9 times in the accumulative 
totals from —9 to —1, so contributes 9 to that total. An item of Xi = 39 is 
represented by 1 in the 30-39 class in column 1, and by 39 in the same class in col- 
umn 3. It also appears as 1 in the —9 class in column 1, and as 39 in the same class in, 
column 3. Its contribution to ZXi is then (3) (10) + (1)(9), or 39, and to Z(Xf) is 
(39) (3) (10) -f- (39) (1) (9) = 1,521, or exactly equal to 39 and (39)^. Similarly, an 
item of X 2 = 2 when Xi = 39, appears in column 5 in the 30-39 line, and in the 
—9 line. It then appears 3 times in column 6, multiplied by 10 each time, and 
9 times in column 6, multiplied by 1. Its contribution to Z(XiX 2 ) is then 
(2) (10) (3) -f- (2) (9) = 78, or exactly equal to (39) (2). It is now evident why the 
entries in the 0 lines are not included in the total— they contribute nothing to the 
product with Xi. 


USE OF THE CHECK SUM 


461 


After the extensions with respect to Xi had been made as just shown, the cards 
would be reclassified with respect to ^^2, and a similar tabulation and accumulative 
totals prepared to obtain 2^2, 2 ^ 1 , 2X2X3) and 2X2X4; and so on for the other 
extensions required. 

This method is of the greatest value where automatic card-tabulation equipment 
is available, and a large number of observations are to be treated. Even for hand 
operations however, if the number of observations are very large it can be used to 
advantage, with the individual items entered on cards or strips for handy classifying 
and adding. 

Use of the check sum. Where a number of different variables are involved, 
every operation in making the extensions, computing the averages and corrections, 
and solving the normal equations through to the “back solution,'' can be verified by 
an automatic check known as the “check sum.” The way in which the check sum 
is used will be illustrated by a small problem, carried through every step in turn, 
but it is equally applicable to any other method of tabulation and is especially 
valuable with machine tabulation, where it serves as an overall control on the 
accuracy of the machine processes. 

The check sum as a check in extending. The values in the following table 
(Table 87 ) may be used to illustrate the use of the check sum. 

The values in the columns X2, X3, X4, and Xi are the three independent factors 
and the dependent factor, which are to be correlated. The values in the column 
headed “ 2 X ” are the arithmetic totals of the values for the four other variables, and 
are designated “the check sum.” 

As the first step, each of these five columns is added. Since, for each line, 

X 2 + X 3 -f X 4 + Xi - 2X 

it also holds true that 

2X2 -f 2X3 + 2X4 + 2X1 - 2 ( 2 X). 

Adding the sums of the first four columns together givers the same value as the sum 
of the check sum column, which verifies all the totals. 

The first set of extensions is made by multi j>lying i.h(^ items in each line liy the 
X 2 item in the first column of that line, giving the values shown undia- “Extensions 
with A' 2 .’' f^ince for each lino 

A 2 4 - A 3 4* A 4 -f A I — 2 A , 

it also follows that 

xl + XsA’a + A'oA' 4 + A2A, - A2i;A. 

Then, adding each eolunm, we find that the sums of the four otlw'r columns should 
total to the same value as the sum of the Ao2.V eolumn. (HKSiking uj), w(i scm^ that 
1,904 + 25,315 + 14,158 + 14,224 = fiS.OOl, v<M-ifying all the (‘aleulations. 

The other extensions are made in similar fashion, and tli(‘ sums of (^■l.ch column 
verified with the sum of the eh(‘ck-sum column, iicciording to th(‘ relation 

2^2X3 -I- 2 Xii 4 - 2X3A4 + 2X3X1 2(X32X) 

and the corresponding relation for the otluM* (*xt, (Visions. It sliould Ix' not(‘d that, ia 
checking the “extonsions with A' 3 ,” the v.alue 2 A' 2 A';t is taken from th(‘ |)n‘vious s(‘t 
of extensions; in checking the “exhmsions with A' 4 ,” the valui^ for 2 \' 2 A 4 is t,ak(Mi 
irora the “extensions with A' 2 ,” aiul the value for 2 X 3 A'.i from thii “ext.imsions with 
X 4 ”; and so on for the remaining checks. 



462 


APPENDIX 1 


table" 87 


Calcjulation of Extensions, Usinq the Check Sum 


Variables 

Extensions with X 2 


X 2 

Xs 

X 4 

Xl 

XX 

Xl 

X^Xz 

X 2 X 4 

X 2 X 1 

X 2 SX 

0 

136 

106 

103 

345 

0 

0 

0 

0 

0 

1 

140 

103 

108 

352 

1 

140 

103 

108 

352 

2 

86 

108 

102 

298 

4 

172 

216 

204 

696 

3 

115 

102 

111 

331 

9 

345 

306 

333 

993 

4 

115 

111 

95 

325 

16 

460 

444 

380 

1,300 

12 

161 

91 

109 

373 

144 

1,932 

1,092 

1,308 

4,476 

13 

235 

109 

118 

475 

169 

3,055 

1,417 

1,534 

6,175 

14 

304 

118 

123 

559 

196 

4,256 

1,652 

1,722 

7,826 

15 

224 

123 

108 

470 

225 

3,360 

1,845 

1,620 

7,050 

16 

185 

108 

100 

409 

256 

2,960 

1,728 

1,600 

6,544 

17 

108 

100 

88 

313 

289 

1,836 

1,700 

1,496 

5,321 

18 

193 

88 

109 

408 

324 

3,474 

1,584 

1,962 

7,344 

19 

175 

109 

103 

406 

361 

3,325 

2,071 

1,957 

7,714 

134 

2177 

1376 

1377 

5064 

1994 

25,315 

14,158 

14,224 

1 

55,691 


Extensions with Xz 


Extensions with X 4 


I'jxtensions 
with Xi 




X3SZ 


xl 


Z4X1 


Z42X 


y2 




18,496 

19,600 

7,396 

13,225 

13.225 
25,921 

55.225 
92,416 
50,176 I 

34.225 
11,664 
37,249 
30,625 


14,416 

14,420 

9,288 

11,730 

12,765 

14,651 

25,615 

35,872 

27,552 

19,980 

10,800 

16,984 

19,075 


14,008 

15,120 

8,772 

12,765 

10,925 

17,549 

27,730 

37,392 

24,192 

18,500 

9,504 

21,037 

18,025 


46,920 

49.280 
25,628 
38,065 
37,375 
60,053 

111,625 

169,936 

105.280 
75,665 
33,804 

78,744 
71,050 


11,236 

10,609 

11,664 

10,404 

12,321 

8,281 

11,881 

13,924 

15J29 

11,664 

10,000 

7,744 

11,881 


10,918 

11,124 

11,016 

11,322 

10,545 

9,919 

12,862 

14,514 

13,284 

10,800 

8,800 

9,592 

11,227 


36,570 

36,256 

32,184 

33,762 

36,075 

33,943 I 

51,775 

65,962 

57,810 

44,172 

31,300 

35,904 

44,254 


10,609 

11,664 

10,404 

12,321 

9,025 

11,881 

13,924 

15,129 

11,664 

10,000 

7,744 

11,881 

10,609 


35,535 

38,016 

30,396 

36,741 

30,875 

40,657 

56,050 

68,757 

50,71)0 

40,900 

27,544 

44,472 

41,818 


409,443 


233,148 


235,519 


903,425 


146,738 



542,521 





USE OF THE CHECK SUM 


463 


While the check sum would not disclose exactly compensating errors made in 
different columns the possibility of such errors is so remote that, after the arithmetic 
has been checked by the comparisons indicated, it may be assumed that no errors 
have been made either in making the multiplications or adding the columns. 

The check sum as a check in correcting to the means. After the values for SX| 
SX 2 X 3 , etc., have all been computed as indicated in Table 87, the process of making 
the corrections to get the values 2x1, 2 x 2 X 3 , etc., may be organized in regular fashion 
and checked by the check sum, as shown in Table 88. 

TABLE 88 

Calculation of Product Sums Corrected to Departures from Means, 

With Check Sum 



A2 

As 

A.1 

A'l 

2 : A 

Inne 

Sums 

134. 

2,177. 

1,376. 

1,377. 

5,064. 

1 

Means 

10.30769 

167.46154 

105.84615 

105.92308 

389.53846 

2 

Extensions with A'‘2. . . . 

1,994.00 

25,315.00 

14,158.00 

14,224.00 

55,691.00 

3 

Corrections 

1,381.23 

22,439.84 

14,183.38 

14,193.69 

52,198.14 

4 

Extensions with X2 

612.77 

2,875.16 

-26.38 

30.31 

3,492.86 

5 



409,443.00 

233,148.00 

235,519.00 

903,425.00 

6 

Corrections 


364,503.77 

230,427.08 

230,594.54 

848,026.23 

7 

Extensions with X3 


41,879.23 

2,720.92 

4,924.46 

65,399.77 

8 

Extensions witii A.i .... 


140,738.00 

145,923.00 

530,967.00 

9 




1 15,644.30 

145,750.15 

536,004.90 

10 



Extensions with x.\ 



1,003.70 

172.85 

3,962.10 

11 

Extension.^ with A"*!. . . . 



146,855.00 

512,521.00 

12 

Corrections 




1 45,856.08 

536,394.48 

13 

Extensions with xi 




998.92 

6,126.62 

14 


The first line in l'al)le 88 given the sums of each of the vai'iahlos, including ih(‘ 
check sum. Dividing by the nuinber of observations (13 in this ca,se), gives the 
mean for each variable, as enhaed in th(^ sc'cond line. Again, the entricvs for tlie 
first four coluinns total to equal the (‘iitry in ih(‘ 2 eolunin, checking the division. 

The sums from th<‘ ^‘(‘xtensions with A' 2 ,” <>f Table 87, arc‘ next entered in line 3. 
The sums for each vai-iai)l(‘, in line 1, are next multiplied by the mean of X 2 (l().307()9), 
and the products ent(u*ed in tln^ (^orrc'sponding column in line 4. Sul)tra.cting the; 
entries in line 4 from those in line 3 giv(\s th(i values wliich a,r(‘ entered in line 5. 
These vidues are tlu^ (^xtcaisions, exprt^ssed as d(‘pa.rtur<'S from the meajis. 

In column X;{, for example', the* entry in line 3 is2A"2A"3; and the entry in line 4 is 

2X:iM2. 

The entry in line 5, tlum, is 2X2A'3 

= 2X2X3 - 
== 2x2.T3 

Again, tlu' va,lu(‘S in tlu^ (irst four eolumns add to the same as the valiK' in tlic 
check-sunii column, v(a-ifying th(‘ work. 



464 


APPENDIX 1 


The rest of the table is entered in similar fashion. Lines 6, 9, and 12 are the 
extensions with X3, X4, and Xi, from Table 87. Lines 7, 10, and 13 are the values 
in the corresponding columns of line 1, multiplied by M 3 , M 4 , and Mi, respectively 
(from line 2). Lines 8, 11, and 14, obtained by subtracting the items in lines 7, 10, 
and 13 from those in 6, 9, and 12, show the values corrected for departures frona 
the means. 

In verifying the sum of the other entries in line 8 by the check sum, the item 
Xx 2 xz must be included, from column X3, line 5, before comparing with the check 
sum; in checking line 11, Xx 2 X 4 and l^x^x^j from column X 4 , lines 5 and 8 , must be 
included; and in checking line 14, the values 2a;irc2, '^xix's, and Xxix^, from column Xi, 
lines 5, 8, and 11, must all be included. For line 11, the other items add to 3,962.11, 
as against the check sum of 3,962.10; and for line 14, they add to 6,126.55, as against 
the check sum of 6,126.52, In both cases the discrepancies are so small as to be 
readily due to raising and lowering in the last digit, and, therefore, may be dis- 
regarded. 

TABLE 89 


Solution of Normal Equations by the Doolittle Method, 
With Check Sum 


Line 

.Y 2 

X 3 

X 4 

Xi 

XX 

I 

612.77 

2,875.16 

-25.38 

30.31 

3,492.86 

V 

-1.00000 

-4.69207 

0.04142 

-0.04946 

-5.70011 

11 

(2,875.16) 

44,879.23 

2,720.92 

4,924.46 

55,399.77 

(-4. 69206) (I) 

(-2,875.10) 

-13,490.45 

119.08 

-142.22 

-16,388.74 

S2 


‘ 31,388.78 

2,840.00 

4,782.24 

39,011.03 

IP 


-1.00000 

-0.09048 

-0.15236 

-1.24283 

III 

(-25.38) 

(2,720.92) 

1,093.70 

172.85 

3,962.10 

(0.04142) (1) 

(25.38) 

(119.08) 

-1.05 

1 . 26 

144.67 

(~0.09048KS2) 



(-2,840.00) 

-256,96 

-432.70 

-3,529.72 

23 



835.69 

— 258 , 59 

577.05 

IIP 



- 1 . 00000 

0 . 30944 

— 0.69056 



B vric Soli: T i ON 





^r2-34 

5i3-24 

^>14-23 




0.04946 

0.15236 

-0.30944 




-0.012S2 

0.O2S0O 





-0.84626 


-0.30944 





0. 18036 





- 0 . 80962 






Lines 5, 8, 11, and 14 now give the values required to determine the regression 
coefficients by simultaneous solution, according to equations (38). 

The check sum as a check in solving the normal equations. The solution of the 
simultaneous equations by the Doolittle method has already been illustrated in 
Chapter 12, page 200. The check sum may be used to verify each step in the com- 
putation, as shown in Table 89. 



SOLVING NOEMAL EQUATIONS 


46r? 


The values from line 5 of Table 88, including the check sum, are entered as line I 
of Table 89. Each item is divided by the first item of the line, with its sign changed 
(—612.77). The quotients are entered as line 1'. The sum of the first four items 
checks to the value in the last column, the check sum. 

The values from line 8 of Table 88 are entered as line II of Table 89, beginning 
with column X 3 (the values enclosed in parentheses will be explained later). Line ^ 
is next multiplied by the value in column Z3 of line V ( — 4.69207), and the products 
entered in the corresponding columns below line 11. These two lines are then summed, 
giving line E2. These operations are now verified by adding the items of line ^2 in 
columns X3, A4, and A'l, and comparing the sum with the check sum in column EX. 
The three values add to 39,011.02, agreeing to 0.01 with the check sum, 39,011.03. 

The values in line ^2 are next divided by the value in column X3, with its sign 
changed ( — 31,388.78). The quotients are entered as line II'. Again the check 
sum verifies the computation. 

The values from line 11, Table 88, are then entered as line III, beginning with 
column X4. (Again disregard the figures in parentheses.) Line I is multiplied by 
the value in column X"4 of line I' (0.04142), and the products entered in the corre- 
sponding columns below line III; and line is multiplied by the value in column X4 
of line ir, and the products entered in the corresponding columns in the next line. 
Line III and the two following lines are then summed, giving line ^3. The values in 
line Zs are divided by the value in column X 4 of that line, with its sign changed. The 
quotients are entered as line HI'. Again the check sum verifies the work. The 
values in line Zz (before the check sum) add to 577.10, which agrees to 0.05 with the 
check sum of 577.05. 

The values in lines I', II', and III' of colimui Xiy with the signs clwnged, are then 
entered at the foot of columns Xo, X3, and X4 (designated here Z;i2-34, &13.24, and 
614.23) . The value at the foot of the .X4 column, —0.30944, is the value for 614.23. 
The item in column X"4, line I' (0.04142), is then multiplied by the last of these 
values (—0.30944), and the product (-0.01282) entered in the X2 column; and the 
item in column X4, line II' ( — 0.09048), is also multiplied by “ 0.30944, and the 
product entered in the Xz column. Tlio two entries at the foot of the Xz cohin n 
are then added, giving 0.1803() as the value for 613.24* The item in column X3, line 1' 
(—4.69207), is then multiplied ])y 0.18036, and the j)r()diict (- 0.84()2r)) enteied 
below the otlicr two entries at the foot of the X'2 column. 'Lhe sum of these three 
entries, —0.80962, is then tlu^ value for 612.34* 

The way the check sum works in checking the ojiorations may be seen by filling 
in the missing spn,cea in Table 89, as indicated l)y tbo entries enclosed in pai'cntheses. 
Thus ill line 11, the first item, 2,875.16, is the same item, Zx^xz, as apiiears in line 1, 
column X3. If wlien line I had been multiplied by —4. (>9207, the operation had 
included the .X2 column also, the product would liave bec'u “ 2,875.16, or exactly the 
same as the item, in liue I, coluinu Xzt with the sign ehatiged. This value, entered 
below liue 11 iu column X2, exa.etly cancels the jircwinus value wIk^u the tw'o lines 
are added, leaving line Z 2 still the same. 

Similarly, the values —25.38 and 2,720.92, from lines 1 and IT of eolninn X4, may 
be entered in parentheses, in columns A'o and Xz of line 111. If th(' ])i'evious opera- 
tions had been carried out in full, below them would a])p(‘ar 2.^). 38 in eolurnn A’j 
(column X4, line I, times —1), and 119.08 and - 2,840.00 (Icohimn X4 line I] 
[—4.69206] and column X’4, line Z 2 , times —1). Wlien the three lines are tx)taled 
to give line Zz, the items exactly cancel out, as before. 



460 


APPENDIX 1 


It should be noted that when all the items are entered in each line, including 
those in parentheses, the sum of the items in columns X 2 to Zi exactly equals, line 
by line, the item in column For that reason, if any error is found when one of 
the S lines is reached, the line in which the error occurred can be determined by 
adding the items line by line, and verifying the totals against the individual check 
sums. To do this it is not necessary to enter the missing items, as has been done 
in Table 89 (in parentheses) ; instead, the items left out can be picked out by going 
up the columns for the particular variable concerned. Thus all the missing terms 
for line III (extensions for X 4 ) and the next two lines appear above in the X 4 column; 
Once the location of the missing items in the previous work has been learned, they 
can be used to verify the computations line by line, and any error readily located. 

The ^^back solution” is simply the solution, in regular form, of lines IIT, IT and 
I' for 64 , & 3 , and 62 * Thus line III', if written out, is 

-54 = 0.30944 

Hence hi - ~ 0.30944, the value at the foot of column X 4 . Similarly, line II', 
written out, becomes 

-h - 0.0904854 = - 0.15236 

Substituting the above value for 64 , and rearranging, 

53 = 0.15236 - (0.09048) (-0.30944) 

= 0.15236 -1- 0.02800 

These last two values are the same as shown at the foot of column X 3 , hence 
53 = 0.18036. 

Similarly line I', when written out in full, 

-62 - 4.6920763 +0.0414254 = - 0.04946 

Substituting values for 63 and 64 , and rearranging, 

62 = 0.04946 + (0.04142) (-0.30944) + (-4.69207) (0.18036) 

= 0.04946 - 0.01282 - 0.84626 = - 0.80962 
exactly as shown at the foot of column X 2 . 

Having computed the values of the three regression coefficients, the final steps 
are (a) to check those values by substituting them in the last equation (line III, in 
full); (5) to compute the coefficient of multiple correlation; and (c) to compute the 
constant <21.234 for the regression equation. These steps are all shown in Table 90. 

The first operation in Table 90 is the final checking of the entire solution, includ- 
ing the back solution. This is done by substituting the values found for the 5's in the 
last equation of the normal equations. For this problem, that equation is: 

2(0:22:4)52 + 2 (a: 30 ; 4)53 +2(0:4)54 == 20:10:4 

The values of the 3 5’s are entered in column 1 of the table, and the values of the 
corresponding coefficients of the unknowns, such as 2 ( 0 : 20 : 4 ) etc., are entered in 



COMPUTIKG MULTIPLE CORRELATION 


467 


colunm 2. The product of each h with its coefficient is then computed, and entered 
in column 3. These add to 172.87, checking satisfactorily with the value of S ( 3 : 1 X 4 ), 
172.85, as shown at the foot of column 2. 

The computation of the coefficient of multiple correlation, according to equation 
(46): 

^2 62(2x1x2) -{- 63(2x1x3) + 64(2x1x4) 

2(3) 

is shown in tabular form in columns 4 and 6. 


TABLE 90 

Final Steps in Solution op Multiple Correlation Problem 


Variable 

Regression 

coefficient 

(1) 

Equation 

III 

(2) 

Chock 

(3) 

Equation 

Xi 

(4) 

Computation 
of E 2 
(5) 

Means 

(6) 

Computation 
of a 

(7) 

X 2 

Xi 

-0.80962 

0.18036 

-0.30944 

-25.38 

2,720.92 

1,093.70 

20.55 

490.75 

-338.43 

30.31 

4,924.46 

172.86 

-24.64 

888.18 

-53.49 

10.308 

167.462 

106.846 

- 8.36 
30.20 
-32.75 

Sums.. . . 


172.86 

172.87 

998.92 

810.15 

105.92 

-10.90 




The values (2xiX2), etc., as shown in Table 88, lines 5, 8, and 11 of column A'’i, 
and 2(xi), shown in line 14, are entered in column 4 of Table 90. Each product sum 
is multiplied by the coiTesponding 6, shown in column 1, and the products entered 
in column 5. The sum of these products is then the numerator of the fraction in 
equation (46). The computation is then readily completed: 


i^i.234 = 


810.15 

99S.02 


0.8110 


R - O.OOOO. With n - 13, and m. = 4, 
li- ^ 1 - (I ~ 0.811) V- = 0.7S4, and li = 0.86 


The standard error of estimate may also be readily computed 
2(ri) = 7<al - 998.92 
2 [(x;)“I - no]\ =810.15 
then since na\ — h<jx\ =" aa? 

Vs'-! ^ nal = 188.77 


Since there are 13 cases and 3 indepcuuleiit variables, 




noz 


188.77 

9 


= 20.97 


= 4.58 


and 




468 


APPENDIX 1 


The a for the regression equation is next computed. Using equation (39), 

<^ 1.234 ~ 62-^2 — hsMz — b4,M4 

we may arrange the work in tabular order as shown in columns 6 and 7 of Table 90. 
The means, from line 2 of Table 88, are entered in column 6, then multiplied by their 
respective h^s, and the products entered in column 7. To complete the computation, 
following equation (39), the sum of this column is then subtracted from the mean 

ai.234 = 105.92 - (-10.90) = 116.82 

This completes the computation of all the linear multiple correlation constants. 
The results may be summarized: 

X{ = 116.82 - O.8IOZ2 + O.I8OX3 - 0.309X4 
El. 234 = 0.86 
/Sl.234 = 4.58 

Tables 87, 88, 89, 90, and the computations following 90, have shown every 
arithmetic step in obtaining these results, arranged in the most convenient form for 
ready computation and checking. For problems involving large numbers of observa- 
tions, the methods of computing the extensions, such as SXf and SX1X2, which are 
shown in Tables 85 and 86, may be used in place of the individual-item method 
shown in Table 87; but, thereafter, the work is the same. The check sum may be 
carried through in computations like those shown in Table 86 just as readily as in 
the longer method, thus providing a complete check on all the tabulating, multi- 
plying, and adding. 

In solving the equations in actual practice, only the items that are not enclosed 
in parentheses in Table 89 would be entered. Table 91 illustrates this same form 
of solution for a six-variable problem. It is carried out step by step, just as was 
Table 89. A slightly different notation is used, but the procedure through line III' 
is the same. Then following line IV, line IV-1 is obtained by multiplying line I 
by —0.256900, the coefficient in line I', column X5; line IV-2 by multiplying line S2 
by —0.414640, the value in line IT, column X5; and line IV-3 by multiplying line Ss 
by —0.168107, the value in line III', column X5. A similar regular order is followed 
at the next step of the process, multiplying lines I, S2, 2,3, and ^4 by the coefficients 
in column Xq, lines I', IT, III', and IV'. Table 91 also illustrates the calculation 
of the back solution, the final check on the values of the 6’s by substituting in the 
last equation (equation V), and the computation of R. 

In entering equations V and Xi, in the second and fourth columns from the right 
in the last section of Table 91, we note that the sequence of the values, from the top 
to the bottom of the columns, is reversed from the sequence at the head of the table, 
from left to right for equation V, and from top to bottom for equation Xi. This is 
because the form of the back solution at the foot of the table places 516.2.346 ^^t the top 
of the regression coefficients and 5i2.3466 at the foot. Hence in entering equations V 
and Xi, they must start off with 2(x6) and 2(x2rc6) and end up with X{x 2 X(i) and 
X{x\x<^. Reversing the order, as shown, produces this result. 

When automatic calculating machines are available to perform the calculations 
shown in Tables 89 and 91, many of the operations shown can be performed in the 
machine without separate recording in the tables. Details of this short cut are 
given on page 478. 



STANDARD ERRORS OF PARTIAL REGRESSION COEFFICIENTS 469 


TABLE 91 

Doolittle Solution of Normal Equations for Six Variables 


('OLUMN DbBIGNATION 


Desit^natioii 

^2 

^’3 

A’4 


Ae 


zx 1 

Equations to be Solved 

Eq.I 

Eq. n 

Eq. Ill 

Eq.IV 

Eq. V 

100.00 

(23.32) 

(19.86) 

(25.69) 

(10.64) 

23.32 

100.00 

(17.47) 

(46.20) 

(21.39) 


26.69 

46.20 

26.28 

100.00 

(29.89) 

10.64 

21.39 

0.33 

29.89 

100.00 

40.17 

60.03 

23.79 

68.07 

36.63 

219.08 

267.41 

187.73 

296.13 

197.78 



Front Solution 


I 

100.0000 

23.3200 

19.8600 

25.6900 

10.6400 

40.1700 

219.6800 

r 

-1.000000 

-0.233200 

-0.198600 

-0.256900 

-0.106400 

-0.401700 

-2.196800 

II 

100.0000 

17.4700 

45.2000 

21.3900 

60.0300 

267.4100 

ii-i 


-6.4382 

-4.6314 

-5.9909 

-2.4812 

-9.367U 

-51.2294 











94.6618 

12.8386 

39.2091 

18.9088 

50.6624 

216.1806 

if.:::;: 


-1.000000 

-0.135769 

-0.414640 

-0.19««i2 

-0.535760 

-2.286130 

in 



100.0000 

26.2800 

0.3300 

23.7900 

187.7300 

iii-i 



-3.9442 

-5.1020 

-2.1131 

-7.9778 

-43.6284 

TTT-2 



-1.7431 

-5.3234 

-2.6672 

-6.8784 

-29.3506 





Xo 



94,3127 

15.8646 

-4.3503 

8.9338 

114,7510 

fii'' 



-1.000000 

-0.108107 

+0.046120 

-0.094725 

-1.216706 

IV 




100.0000 

20.8900 

68.0700 

295,1300 

TV-i . . . , 




-6.6908 

-2.7334 

-10.3107 

-66.4358 

IV-? 




-16.2577 

-7.8403 

-21.0067 

-89.6371 

TV-8 




-2.6653 

+0.7313 

-1.5018 

-19.2904 









74.4772 

20.0476 

36,2418 

129.7667 



IV' 




-1.000000 

-0.269178 

-0.473189 

-1.742367 

V 




100.0000 

35.5300 

197.7800 

V-i 





-1.1321 

-4.2741 

-23.3740 

V-2 





-3.7810 

-10.1306 

-43.2279 

V-3 





-0,2007 

+0.4121 

+5.2930 

y_4 





-5.3064 

-9.4863 

-34.9303 







Xe 





80.4898 

12.0511 

101.5408 

V' 





-1.00000 

-0.134664 

-1.134664 







1 

Uaue Solution 

T?n V 

rin'fk 





t*i:»2340 

6 in -23.111 

Vii\. V 



■f 0.40 1701) 

+0.636759 

+0.004725 

+0.473189 

+0.134664 




-0.014328 

-0.020928 

+0.006212 

-0.030249 

+0.134664 

100,00 

13.4664 


-0.112250 

-0.181173 

-0. 073*163 

+0.436940 


20 . 89 

13.0601 


-0.005468 

-0.003731 

+0.027484 



0,33 

0.0091 


-0.075540 

+0.323927 




21.30 

6.9288 


H-0. 194124 





10 (VI 

2,0655 






Eq. V 

- 35.53 

35.5299 


Eq.X, 


[("ompu- 
tiition 
of Ri 


68.071 

23.70* 

60,03 

40,171 


4.7846 
120.7425 
0,6538 
10.4453 
7.7080 
<12 , 4242 


ii2 = 


62.4242 

100.00 


It - V ii 624242 = 0. 700000 


Standard errors of partial regression coefficients and standard error of an indi- 
vidual estimate. The computation of standard errors of nc*t or jiartial regrcflaion 
coefficients by equation (74), as discuswMl in Cbaptcr 18, and of standard errors of 






470 


APPENDIX 1 


an individual estimate, by equations (77) or (81), as described in Chapter 19, may 
be much simplified by the following procedure: 

For three independent variables, set up the normal equations: 

2)(X2)c 22 + S (3:23:3)023 + S(x2a:4)C24 = 1 

S(a: 2 a; 3 )c 22 + 2(3:3)023 4 - 2 (3:33:4) C 24 = 0 
2 ( 3 : 23 : 4)022 + 2 ( 3 : 33 : 4)023 + 2 ( 3 : 4)024 = 0 


Solve simultaneously to obtain the values for 022 , C 23 , and 024 - Then set up 
exactly the same set of equations, with C 32 , C 33 , and C 34 as the unknowns, and with 
0, 1, and 0 to the right of the equal signs, in the first, second, and third equations, 
respectively, and solve. Then set up again, with 042 , C 43 , and 044 as the unknowns, and 
with 0, 0, and 1 to the r.*ght of the equal signs, and solve again. The standard 
errors of the regression coefficients may then be found by the following equations 
(for proof, see Note 13, Appendix 2): 


^12.34 >S'1.234^^C22 

^6ia.24 = /§1.234'^/^ 
<^14.28 = ^1.234'^''^ 


( 101 ) 


It will be noted that, except for the values to the right of the equal sign, the 
coefficients of the equations are exactly the same as those required to obtain the 
values of 612 . 34 , 613 . 24 , and 614 . 23 - For that reason the values for C 22 , C 33 , and C 44 niay be 
most readily calculated by introducing as many new columns in the form of the 
Doolittle solution (Table 91) as there are independent factors, between the columns 
for Xi and S. These columns will be 


Line 

Error 62 

Error h 

Error 64 

Error 65 

(Eq-I) 

1 

0 

0 

0 

(Eq. 11) 

0 

1 

0 

0 

(Eq. Ill) 

0 

0 

1 

0 

(Eq.IV) 

etc. 

0 

0 

0 

1 


These values can be included in the check sum, and the operations carried through 
for them just as for the other columns until the “back solution” to find the 6 ’k is 
reached. Then a separate “back solution” can be run for each set of “c” values, 
starting with the values in each “Error” column just as the back solution to fin<l 
the 6's started with the values in the Xi column.^ 


^ For an explanation of why this process and equation ( 01) gives the standard 
error of the 6's see Note 14, Appendix 2. For other uses of the “c” constants, see 
R. A. Fisher, Statistical Methods for Research Workers^ seventh edition, pages 160-168. 
Oliver and Boyd, Edinburgh and London, 1938. 




STANDARD ERRORS OF PARTIAL REGRESSION COEFFICIENTS 471 


TABLE 92 

Solution of Normal Equations by the Doolittle Method, to Calculate 
Regression Coefficients and Their Standard Errors 


Line 

Designation 

Column Designation 

1 ^2 

X3 


Xi 

C2 


C4 

S(X -1- c) 


Equations To Be Solvi 

!D 


Ea I 

612.77 

2,875.16 

-25.38 

2,720.92 

1,093.70 

30.31 

4,924.46 

172.85 

1 

0 

0 

1 

0 

0 

3,493.86 

66,400.77 

3,963.09 

ii.:: 

2.876.16 

44,879.23 

0 

Ill 

-25.38 

2,720.92 

0 



Feokt Solution 


I 

r 

II 

I (-4.6920704) 

III 

1(0.0414185) 

S2(-0.0904782) 


612.77 
i-l.OOOOOOO 


2,876.16 
-4.6920704 

44,879.23 

-13,490.45 

31,388.78 

- 1.0000000 


-26.38 

0.04141851 

2,720.92 

119.08 

2,840.00 

-0.0904782 

1,093.70 

-1.05 

-256.96 

835.69 

- 1,0000000 


Back Solution on 


C22 


0.0016319 

0.0000231 

0.0009384 


0.0025934 


-0.0001405 

-0.0000505 


- 0.0002000 


C24 


0.0005676 


0.0005576 


B.\ck Solution on <^3 


-O.ODO’JOOO 


<•3:1 
0.0000319 
0.0000098| 
00.001)04171 


ru 
O.QIH)1083| 
-G.()0010S3 


30.31' 

0.0494639 

4,924.46 

-142.22 

4,782.24 

■0.1623651 

172.85 

1.26| 

-432.691 

-268.68 

0.3094210 


E(i. II— C2 


1.00000 
-0.0016319| 

0 

-4.( 

-4.1 

0.00014951 
0 

0.04141861 

0.4245300 

0.4059485 

-0.00065761 


Check 


2,720.1 

44,879.231 

2,875.10 


0 
0 

I.OOOOOI 
0 

1.00000 

-0.00003191 

0 
0 

-0.0904782 

-0.0904782 

0.0001083 


0 
0 

0 
0 
0 
0 

1.00001 
0 
0 

1.0000000 

-0.0011966 


3.493.86 

-6.7017477 

66,400.77 

-16,393.44 

39,007.33 

-122427167 

3.963.00 
144.71 


678.49 
-0.6922304 


1.52! 

-8.981 

7.46| 


0.00 


K(j. II --C3 


Back Solution on c.\ 


O.U()()557(l 


-O.OOOIOS.') 


'*11 
0.001 i966| 

o.oonoiHi 


2,720.92| 

44,8792231 

2,876.16 


1.00! 


Eq. II—c^ 


2,720.92 
44,879.231 
2, 876.161 


0,00 


0.00 


Check 


-0.29 

1.87 

-0.58 


1.00 


Cheek 


3.26 

-4.80 

1.60 


0.00 





472 


APPENDIX 1 


Table 92 shows all the computations necessary to compute all the 6's and c*s from 
the product sums calculated in Table 88, except for the back solution on Xi, as shovsm 
m the lower section of Table 89. Table 92 thus replaces all of Table 89, except this 
last section. In practice, this back solution would be included in Table 92 ahead 
of the three back solutions on C2, ca, and C4. 

In computing Table 92, the work in the c columns is carried out to two more 
decimal places than in the other columns. This is necessary because of the small 
size of the values involved. It shoxild also be noticed that in the back solution on 
C3, only C34 and C33 are calculated directly. Since C32 is identical with C23, the value 
previously calculated for the latter is inserted instead. Similarly, the back solution 
on C44 involves no additional calculating at all, since C44 is copied down (with the sign 
changed) from line IIP, C24 is written down for C42, and C34 for C43- Only the com- 
putation by substitution in the check equations is involved. Even that computation 
can be omitted for the C4 values, since each of them has been checked earlier — C42 and 
C43 by substitution and C44 by the check sum in lines ?3 and IIP. 

As a result of these computations, the following values are secured: 

C22 = 0.00259; C33 - 0.000042; C44 = 0.00120 

Since <81.234 = 4,58, the standard error of the h^s may be readily calculated by 
equation (101) 

<^6ij.s 4 = 4.68 Vo.00259 = 0.233 
= 4.58 V'o.000042 = 0.030 
= 4.58 V'o.00120 = 0.159 

The net regression coefficients may then be stated 

^12.34 ” — 0.810 i 0.233 
fci3.24 = 0.180 ± 0.030 

&14.23 = — 0.309 ± 0.159 


Just as in the illustrations discussed in Chapter 18, some of the net regression 
coefficients are much more reliable than are others. If we assume that the conditions 
of random sampling are fulfilled, there is some possibility that the regression for 614.23 
in the universe from which the sample was drawn is really positive instead of negative; 
but there is only a very slight chance that 612.34 is really positive, and it is almost a 
certainty that 613.24 is really positive, and above 0.1. 

The computation of the standard errors of the net regression coefficients, by 
the method just presented, is not a difficult one. It should be made an integral part 
of every multiple correlation solution, so that not only will the regression coefficients 
be obtained, but also the amount of confidence that can be placed in each value will l)e 
determined. Only if that is done can the regressions be interpreted with confidence. 

The computations shown in Table 92 also give all the values needed to estimiite 
the standard error of an individual estimate. Substituting these values in equation 
(77), and using the value for 5i.234 previously calculated on page 467 (in practice, 



STANDARD ERRORS OF PARTIAL REGRESSION COEFFICIENTS 473 


the calculations on that page would all be made after Table 92 was calculated, in- 
cluding the back solution on Xi), we have: 

(X) = 4.58 + ^ + •00259®! + .000042®! 

+ .00120*4 + 2(— .00020)®2®3 + 2(.00066)®s®4 + 2(— .000H)®3®4~| 

The use of this equation may be shown as follows: Suppose we draw a new 
observation from the same universe as that from which the original sample (shown 
in Table 87) was drawn, and the new observation has values of 18 for X2, 300 for X3, 
and SO for X4. After we estimate the probable Xi value from the regression equation, 
how much confidence can we place in that estimate? 

The estimated value works out as follows: 

The regression equation (from Table 90) is 

Xi = - 10.90 - 0.80962X2 + 0.18036X3 - 0.30944X4 

= - 10.90 - 0.80962(18) + 0.18036(300) - 0.30944(90) 

= 0.70 


Before the values of A'2, X3, and X4 for this new observation can be substituted in 
equation (X), they must be put in the form of 3:2, 0*3, or x^. Using the means shown in 
Table 8(i, we culcuLate 

®2 = Xj - Mi = 18 - 10.31 = 7.69 
®,i = A'j - M;i = 300 - 167.46 = 132.64 
®4 = X.1 - M 4 = 90 - 105.84 = - 15.84 

Snhst.itut.ms these values in equation (X), we hav(^ 

= 4.58 “ + .00259 (7.69)** + .000042(132.64)’“ 

+ .0()I20(-IS.84)’“ + 2(-.00020)(7.69)(132.64) + 2(.00066)(7.69)(-15.84) 
+ 2(-.()0()n)(132.64)(-15.84)J = 10.0205 

== 3.17 

W e can now say that our estimate of Xi, for the new observation with A"2 == 18, 
A'3 - 300, and X4 = 90, is X[ = 0.79 dl= 3.17. Alternatively, applying the method 
explained on pages 343 and 344 of Chapter 19, we can say we feel confident that the true 
value of Xi for the new observation will lit' between —(>.37 and 7.05, knowing that 
such a statement will Ix' wrong in only one out of twenty such statements, on ihi\ 
average. 

It is evident from this illustration that the standard (‘rror of this i)!irtieular 
estimate, 3.17, is larger than the standard error of estimate for the s:imi)l(', 2.14. 
That is because the values of the independent variables for this lU'W obscM'vation lay 
near the extremes of their s<‘vcral ranges in the sample. It. is also (“ividiud. that t.ho 



474 


APPENDIX 1 


value of will vary with each new observation, depending on the combination 
of values for the independent variables in each observation. 

Coefficients of partial correlation. Computation of the coefficients of partial cor- 
relation by equation (50) involves the calculation of the multiple correlation of the 
dependent variable with successive sets of the independent variables, with a different 
independent va.iable left out in each set. Thus, for the four-variable problem whose 
solution is shown in Table 89 (and 92) the three coefficients of partial correlation 
involve not only the value S 1 . 234 , but also S 1 . 23 , S 1 . 24 , and S 1 . 34 . These may be cal- 
culated readily by the same process shown in Table 89. It is not necessary to repeat 
the process three times, however, as the several columns may be rearranged with little 
additional calculation to omit each independent variable in turn. The first two stages 
in this process are illustrated in Table 93. The values for lines I and I' and S 2 and II' 
are copied from Table 89, as shown in the first four lines of Table 93. The columns 
X 4 and SX are dropped, however, as they are not needed at this step. 

Lines I' and II' give all the information needed for the “back solution” with X 4 
omitted. This is accordingly given in the second section of Table 93, using the same 
form as in the back solution of Table 89. The verification of the &’s by substitution 
in equation II, and the calculation of 1 ^ 1.23 by use of equation I are also shown, organ- 
ized the same as in Table 91. 

The next step is to enter the values necessary for the “front solution” with Xz 
omitted. This is shown in the third block of Table 43. Lines I and I' are entered 
again, with the column for Xz omitted. Lines III and (0.04142) (I) are copied from 
Table 89 for columns X 4 and Xi. All that is necessary to complete the front solution 
is to add the new totals, Ss, and to divide col. Xi by col. X 4 , to get the new line 
Sill', and then to proceed with the back solution, as before. The check on equation 
III and the calculation of Rim by substitution in equation I are also shown under 
the back solution on Z 3 . 



COEFFICIENTS OF PARTIAL CORRELATION 


476 


TABLE 93 


Doolittle Solution of Noemal Equations, to Find Coefficients of Partial 
Correlation, for Three Independent Variables 


Line 

Desig- 

nation 

Column Designation 

^^2 

Xz 

Xi 


I 

r 

Ss 

ir 

612.77 

-1.00000 

2,875.16 

-4.69207 

30.31 
; -0.04946 


31,388.78 

-1.00000 

4,782.24 

-0.15236 

Back Solution, X4 Omitted * 

Eq. II 

Check 

Eq. I 

Compu- 
tation of 

i^l.23 


^ 12.3 

?>13.2 

-hO.04964 

-0.71488 

-0.66524 

+0.15236 

+0.15236 

44,879.12 

2,875.16 

6,837.78 

-1,912.67 

4,924.46 

30.31 

750.29 

-20.16 

4,924.46 

4,925.11 

998.92 

730.13 




Front Solution, Omitted 


I 

I' 

Ill 

(0.04142) (I) . 

2., 

iir 


Xi 

Xi 


612.77 

-1.00000 

-25.38 

0.04142 

30.31 

-0.04946 


1,093.70 

-1.05 

172.85 

1.26 


1,092.65 

-1.00000 

174.11 

-.I.IOSS 

Back Solution, X [] Omit'ced 

Eq. Ill 

Check 

Eq. I 

Compu- 
tation of 

/ei.24 


5i2.4 

?q4.2 

0.04946 

0.00()()0 

0.05606 

0.15935 

0.15935 

1,093.70 

-25.38 

174.28 

-1.42 

172.85 

30.31 

27., ’54 
1.70 

172.85 

172.86 

998.92 

29.24 






476 


APPENDIX 1 


The final step in computing the needed coefficients of multiple correlation in- 
volves calculating Rim- Since this involves rearranging Table 89 to omit X2, which 
appears in the first column, it is necessary to carry through an entire new front solu- 
tion, with X 2 omitted. This process is shown in Table 94. (The column SX is a 
new 2), obtained by adding the values in columns X3, X4, and Xi for lines II and III, 
and then using it as a check thereafter.) 

In problems where there are four independent variables, this new back solution 
should be arranged in this column order Xa, Xs, X3, X2, Xi. After the entries were 
calculated through the front solution, two back solutions could then be run, one 
leaving out the X2 column, and one the X3 column. Where there are six independent 
variables, a third step could be used by repeating the last two steps of the front 
solution for the third independent variable to be dropped out; or a complete new 
solution could be run with X 3 and X 4 occupying the last columns before Xi. In many- 
variable problems, various other time-saving combinations can be worked out by the 
ingenious computer. 


TABLE 94 

Doolittle Solution of Normal Equations, to Find Coefficients of Partial 
Correlation, for Three Independent Variables {continued) 


Line Desig- 


CoLUMN Designation 


NATION 



Xi 

SX 

II 

44,879.23 

2,720.92 

4,924.46 

52,524.61 

ir 

-1.00000 

-0.06063 

-0.10973 

-1.17046 

Ill 


1,093.70 

172.85 

3,987.47 

(-0.06063) (II) 


-164.97 

-298.57 

-3,184.45 

S2 


928.73 

-1.00000 

-125.72 

0.13537 

803.01 

-0.86463 


Back Solution, X2 Omitted 




?>14.3 

Eq. Ill 

Check 

Eq. I 

Compu- 
tation of 
1^1.34 


0.10973 

-0.13537 






0.00821 

-0.13537 

2,720.92 

-368.33 

172.85 

-23.30 


0.11794 


44,879.23 

5,293.06 

4,924.46 

580.70 




4,924.46 

4,924.73 

998.92 

557.40 


RIm = = 0.558002 





ALTERNATIVE METHODS OF SOLVING NORMAL EQUATIONS 477 


Tables 90, 93, and 94 provide all the values necessary for the calculation of the 
partial correlation coefficients, using equations (47) and (50). These calculations 
may be tabled as follows: 



(1) 

(2) 

(3) 

(4) 

Variables 

Uncorrected 

1 - 

71 — 1 

1 - 



71 — 771 

(3) X (2) 

1.234 

0.8110 

0.1890 

= 1.3333 

0.2520 

1.23 

0.7309 

0.2691 

if = 1.2000 

0.3229 

1.24 

0.0293 

0.9707 

if = 1.2000 

* 1.0000 

1.34 

0.5580 

0.4420 

if = 1.2000 

0.5304 


♦Taken as 1 .0 (largest possible value), when estimate of probable value exceeds 


1 . 0 . 


f\2M = I 
= 1 
fUn = 1 


1 — 

1 - /7I.34 
1 - /?L2a4 

I - /5?.24 

1 - 

1 “ ^1.23 


0.2520 

1 - - - 1 - 0.4751 = 0.5249 

0.5304 

0.2520 

0.2520 

' -07^9 


fl2.34 — ”0.72 
f 13.24 = 0.87 

ri4.23 = —0.47 


The signs of the partial correhitioii coefficients are takcni From the signs of the 
corresponding net regression coefficients, as shown in Tal>le <S9()r 00. 

Alternative methods of solving normal equations. Tli(‘ naslhods for solving 
normal equations and ob(,aining the various eonsinni-s luu'e.ssary in corndiit.ion 
analysis, which have Immui prissenttMl in d’ahles 89 to 94, ineliisive, employ the so-civlled 
Doolittle method of solving ecpiatioiis, first. dev(doi)e(l by Dr. M. 11. Doolittle, a 
computer in the GeocU'tic Survc^y.'’ His nuibod involved slighl. niodilicjitions of the 
methods originally suggiNsti^d by (laiiss, the dis<‘.(»vt‘r(U- of thi^ UN'isl.-H(iuan\s t<^ehni<iii(i. 
(The solutions shown in 'I’jibles 93 and 94 involvi^ sliort cuts addoid by the author of 
this book.) The usc‘ of tlu^ 01 -0, <*ic., nudhod of calculating tuTor fornuilas (the 
reciprocal matrix), wjus also first, (levtdoped hy ( Ijiuss, hikI w:is reviviul by It. A. 
Fisher. Its further applieation to (uilculating the standard (rrors of an iiulividuiil 
estimate was <loV(h)p(Hl hy Dr. Meyer dirsli'Kdv of the: U. S. Deipartnumt. of Agricul- 
ture, at the author’s ri‘(pK;.st,. 


M. 11. I )()olitt 1(^, AdjustnuMit of th(‘ primary t.i’iangulat ion betwcHui Kt*nt 
Island and Atlanta l):ts(^ lines (!*iipc‘r No. 3, Midbod (‘mploycal in iJu'^ solut.ion of 
normal (‘<inations jind the adjustiiuMit of a t.riangulat.ion ), Hepoi’t. of tlu; Suiicr- 
intendent., doast. and C«('o<U‘ti(^ Survey, 187K, pj). 115 120, 



478 


APPENDIX 1 


Since the normal equations are in the form of a symmetrical determinant, the 
methods of determinantal algebra can be applied in their solution. Those familiar 
with determinantal and matrix algebra may find forms of solutions based on these 
principles, developed by Frederick V. Waugh,^ more convenient than the methods 
illustrated here. Careful comparisons of the Waugh solutions with the Doolittle 
solutions by the author of this book, however, have revealed that the DooUttle 
method involves somewhat fewer calculations if all that is desired are the constants 
whose calculations have been illustrated to this point — the coefficients of net regres- 
sion and multiple correlation, the coefficients of partial correlation, and the standard 
errors for the regression coefficients and for individual estimates. If one also wishes 
to determine all the other possible coefficients of multiple and partial correlation 
(■^ 2.134 as well as Ri.zu, and r 23 .i 4 as well as ri 2 . 34 , etc.) and all the other net regression 
coefficients and equations ( 623.14 as well as 612 . 34 , etc.), the Waugh method is faster. 
Since these additional coefficients are rarely used, and since the Doolittle method 
can be understood more readily by students whose mathematical training has not 
extended beyond relatively simple algebra, the presentation here has been restricted 
to the Doolittle method. 

A somewhat simpler short cut in the solution of the normal equations has been 
suggested by P. S. Dwyer.® He points out that much of the “front solution” 
involves subtracting a series of products from, or adding them to, a given figure. 
In Table 91, for example, the item that appears in line S 4 of column X 5 is simply 
the value: 

100.0000 + (25.6900) (-0.266900) + (39.2091) (-0.414640) + (15.8546) (-0.168107) 
= 74.4772 

With modern computing machines this value can be computed directly without 
clearing the total dial, using the reverse lever whenever the product is to be sub- 
tracted instead of added. This method saves reading off and entering in the table 
the values that appear in line IV-1, IV-2, and lV-3. It is slightly more accurate 
as it obviates the possible errors in rounding off each of the products as they are 
entered in the table. The calculations in the machine, for example, may be carried 
to ten decimal places, and only the final sum is rounded off. The method does 
involve one handicap, however, in that the multiplier (as for example —0.256900 
for line IV-1) has to be set up on the keys separately for each column in turn, whereas 
in the usual method it can be set up and left in the machine as a constant multiplier 
as all the products clear across line lV-1 are computed. Whether this additional 
operation and possibility of error offset the other savings each computer can 
determine for himself. 

Using this Dwyer short cut all the way through, the front solution of Table 89 
would show only the lines I and T, S 2 and IF, and S 3 and III'. Similarly Table 91 
would show in the front solution only I and II', S 2 and IT, S 3 and III', S 4 and IV', 
and Sfi and V'. Various other possible modifications of the Doolittle solution, all 
based on the same basic principle, are shown in Dwyer's article referred to above. 


^ Frederick V. Waugh, Journal of the American Statistical Association^ December, 
1935, and December, 1936. 

® P. S. Dwyer, The solution of simultaneous equations, Psychometrika, Vol. 6, 
No. 2, April, 1941. 



AUXILIARY GRAPHIC PROCESSES 


479 


Computing residuals for multiple curvilinear correlations. Where there are a 
large number of individual observations, the average residual around the net regres- 
sion line may be computed from group averages, instead of calculated for each 
individual observation as described in Chapter 14. This may save much time in 
calculating the average residuals to obtain the first approximation regression curves. 

After the net linear regression coefficients are computed, the observations are 
thrown into groups with respect to the first independent factor, say X 2 , and averages 
of each factor are computed for the records falling in each group. If there are four 
groups, for example, there will be four sets of averages. 


Value of X 2 

Average X 2 

Average Xz 

Average X\ 

0 - 9 

M2-1 

Mz-i 

Mi-i 

10-19 

M 2-2 

Mz^2 


20-29 

A/2-3 

Mz-z 

Mis 

30 and over 

A/2-4 

A/3-4 

Mis 


The average estimated value, il/*', may then be calculated for each group by 
substituting the means for that group in the regression equation. Thus for the 
first group, 

Mx> = a -b &2(^f2-l) H" i) 


and 


M, = 


In a similar manner the average residual may be calculated from the group 
averages for each of the other groups, and then plotted as a departure from the net 
regression lino, as illusti‘at(‘d in h’iguro 35 of Chapter 14. After the computation is 
completed for A' 2 , the records may ho reclassified with respect to A'g, new means 
calculated for each variable for eacdi group, and the procross cc^ntinued just as for 
Xz- The same steps are carried out for each other iiidepcmdent variable in turn. 
This method may be u.sed to d(‘t<u-mine the lart nrsiduals around curvilinear regres- 
sions fitUrd by rnalhematical ciirvcrs just as well as for linear rcgrcissions. 

Once the first set of freehand approximation curves has been drawn, the remainder 
of the work lias to be carried forward just as described in Chaptca’ 14, as the average 
of values along a curve do not pr(MM.s(‘ly represent that curve in the same way that 
the average of values along a straight line will r(‘pres<ait that line. 

Auxiliary graphic processes with the short-cut graphic method. The short-cut 
method of determining a not curvilinear regression, described in Chapter 16, may be 
materially aiderd liy using graphic ineth(Mls in transferring departures from one figure 
to another, aiul in calculating the n,v(U-ag(\s of the viihies as plotted. 

Aft(^r the original ohscu’vat.ions are ])loti(‘d and th(^ first approximation to the 
regression lino or eurvo is dra wn (as in J^'igure 51 of Chapbu- 16), the departures from 
that line must be plotted against th<^ nc^xt variables. A proecsdurcs for making those 
transfers graphically is shown in I*’igurea 77 to 81. The first step is to place an 
arrow in the middle of a strip of blank p:ip<sr. Using tins arrow to indicate the 
position of the regression line or curves, we mark off on tins paper tins verthsal depar- 
tures of each ohs(‘rvation from that line, with each obstu'vation indi(sated by its 
number. Figure 77 show's this proce.ss just as the first observation (lib) is marked on 











AUXILIARY GRAPHIC PROCESSES 


481 


Cost per ton 



Fia. 78. This shows the process of scaling oiT tho cloparluros jiartially completed 










AUXILIAEY GRAPHIC PROCESSES 


485 


the strip. Figure 78 shows it after several such values have been marked on the 
strip. The process is continued (with one or more strips of paper) until the vertical 
departures have been marked off for each observation. 

The next step in the process is to transfer these departures to the next figure, 
Figure 52 of Chapter 16. After the chart form has been prepared, the arrow on the 
slip is centered on the zero line, and the departures marked on the figure, with the slip 
moved to the corresponding value as ordinate. Figure 79 shows this step just 
after the value for the 1920 observation was entered on the chart. After the value is 
marked on the new figure, it is crossed off on the slip, to prevent confusion. Figure 80 
shows the process completed, just as the last value on the slip — that for the 1933 
observation — is entered. (It will be noticed that the values are transferred in 
sequence from top to bottom of the slip, to prevent confusion.) 

After the new curve is inserted on the chart, the next step is to transfer residuals 
from the new curve to the next figure. The departures can be scaled off from a 
curve as readily as from a lino. Figure 81 shows the start of the next stage of the 
process, after the departures for the observations for 1920 and 1937 have been scaled 
off from the first approximation curve on Figure 52, and just as the value for 1936 
is entered. The process is (*oinpleted and carried on to the next chart (Figure 53) 
just as illustrated above. The same process is used in transferring the departures for 
each stage in the approximation process, always scaling off the residuals from the last 
approximation curve, and plotting them as departures from the last curve on the 
next chart, prior t o drawing in the new curve. 

After the dci)artures are entiTcd, averages of departures are sometimes needed. 
In such cases, graphic means can be used to average each group of observations. 
To do this, an approximate* av(*rag(^ is inserted by eye. Then all the positive depar- 
tures of the ol)servations in that group from the aj^proximato averag(^ are accumu- 
lated on one slip, scaled olf each in turn as an addition to the other departures, and 
all the negative d(‘part ures from the approximate average are a(‘cumulated on an- 
other slii). Tlie difference between tlu* two acaaimulalions is divided V)y tlie number 
of cases, giving a j)his or minus correction to the ap])roxima(t‘ av('rage. At later 
stag(‘s whi’ii aviTage deviations from a previous line* or curve an* (h'sinal, graphic 
accumulat ions can lx* us(‘d similarly, with the previous line used as the first approxi- 
mation to the new average. 



APPENDIX 2 


TECHNICAL NOTES 


TTote 1 (Chapter 2). The formula for estimating the standard error of an average 
is derived as follows: 

Assume that we have N random samples of n observations each, all expressed 
in deviations, x, from the true mean of the variable, X, we are sampling. Designate 
the successive observations in each sample x', x", x'", etc., as shown in the following 
tabular statement: 


, Observation 

Sample 1 

i 

Sample 2 

Sample 3 

1 

Sample N 

1 

x' 

x' 

a:' 

x' 

2 

a:" 

x" 

a:" 

x" 

3 

x'" 

x”’ 

x"' 

x"' 

4 

x'"' 


a:"" 


n 

.... 

Xn 

'Zxi 

Xn 

1^X2 

Xn 

v 

Xn 

V 

2jXm 


The first observations in the several samples (line 1) will have a standard devia- 
tion {<Tx') which will tend to approach the true standard deviation {ctx) of the universe 
of a;’s from which the observations are drawn. As N, the number of samples, 
becomes larger and larger, <jrx' will tend to agree more and more closely with <rx. 

The second observations in the several samples (line 2), will have a standard 
deviation Qrx") which will also tend to agree more and more closely with the standard 
deviation of the universe (crx) as N becomes larger and larger. 

Suppose now that the first and second observations in each siimple are added. 
Let this sum be designated a: 1 + 2 . That is. 


X' + x" = Xi+2 

The standard deviation of all the xi+ 2 's from the N samples is, by definition, 


But each 


4 


N 


(® 1 + 2)2 = {z' + x"f 

= [x'f + 2x'x" + ( 2 ")* 

S(iei+ 2 )' = SCrt')’- + 2i:x'x" + 
486 


Hence, 



NOTE L (CHAPTER 2) 


487 


If the successive observations in each sample, a;', are obtained by true 

random sampling, they will not be correlated with each other; that is, a large value 
for x' will be just as likely to be followed by a small value for x" as by another large 
value. 

The correlation of x' with x” will tend to approach 0 as the number of samples, 
iV, is increased. But if = 0, 'Lx'x'^ will equal 0. Hence, if the successive 
observations x' and x" are uncorrelated, the last equation becomes 

Dividing by N 

2 _ 2 1 2 
0-1+2 = Cx> + Cx'* 

But (Tx' and o-j;// both tend to equal cx 
Hence 

0 - 1+ 2 = 20 - when N is very large. 

Similarly, if the first three observations are added, the standard deviation of 
their sum, a;i+2+3, will be 

0 2 I 2 1 2 

<’'1 + 2 + 3 — "r o'js'"’ 

= 3o-|, when N is veiy large. 

So if all n observations in each sample are added, the standard deviation of the 
sum, a;i+ 2 +- • •+«, will be ^ ^ 

01 + 2+. . . +n = O'x' "b + . . . + (Txn 

= wo/, when N is large. 

If each observation in each sample and also the sum of all the observations in 
each sarn])Ie ar(^ divided through by n (the total number of observations in each 
ramplo), each ;r' will become x' /n, and each '^x will become Xx/n. The standard 
deviation for the series of values ;r'/?t for the first observation in each sample may 
then be computed 


ea(di ( 

^x'X _ ^ 

\ n ) 

and 

Y - 

^ \ 


Dividing through by N, 



2 

nr 

and since ax' hauls to (‘ciual ax 

2 

when N is very large. 

71. 


The standard deviation for the second scries of ()l)servations x" /n, similarly 
will be 

2 2 
2 <Tx 



488 


APPEND]:X 2 


The standard deviation of the sums of x' /n + x** jn likewise will be 


2c 


2 2 

2 _ ^ _l_ ^ — 

O'*' /«•+*"/ w “ 9 “r 9 ~ 


So the standard deviation of the sums of all the n values x^ In to Xn/w will be 


2 

<^x' • -+Xnln 


i + -i + 


+ 


2 


/a:\ 2a: 

But the sums S ( - I of all the values x'/n to Xn/n^ that is the values — , are the 
\n/ n 

arithmetic averages for each sample, Mx. Hence, under conditions of random sam- 
pling, the standard deviation of the arithmetic means of samples of n observations is 
given by the last equation, which may be written 


c\f = ~ , and hence, cm 
n 


<^x 

y/n 


It should be noted that cx is the standard deviation of all the items in the uni- 
verse, not the standard deviation of the items in a random sample. It has been 
shown by *'Student’^ (Biometrika, Vol. 6, p. 1, 1908) that the standard deviation, 
csj of the items in a small random sample, calculated not from the true mean but from 
the mean of the values in that sample, tends to be less than the standard devia- 
tion, Cx, of all the items in the universe from which that sample was drawn . To obta in 
an estimate of the true standard deviation which is free from this bias, the standard 
deviation calculated from a sample of n observations must be adjusted by the 
equation ^ 




nc\ 


n — I 


1 


This is the origin of equation (6.1), given in the text. 


Hence 




■y/n 


It should be noted that an essential assumption made in deriving this formula 
is that the successive observations x', x”, etc., are not correlated with each othei-. 
In many types of economic problems, such as time series, for example, this assun.p- 


1 For a more extended discussion of various adjustments to obtain an unbiased 
estimate of cx, see W. Edwards Deming and Raymond T. Birge, On the statistical 
theory of errors, Reviews of Modern Physics, Vol. 6, 119-lGl, July, 1934. 



NOTE 2. (CHAPTER 5) 


489 


tion may be incorrect.* The effect of that fact upon the usefulness of error form\ilas 
for time series has already been discussed in Chapter 19, pages 349 to 356. 

Note 2 (Chapter 6). In fitting a straight line, the requirement is to determine, 
from the series of paired observations of X and F, values for a and h in the equation 

r ^ a + hX 

which will make the sum of the squares of the residuals, Y — F', as small as possible. 
The values whose sum is to be minimized are 

(F - F')^ - (F - a - bX)^ 

= F2 + + 52^2 _ 2aY - 26FX + 2ahX 

The sum of these values is therefore 

V(y _ y/)2 = vy2 ^^2 ^ b^Z(X^) - 2aS(F) - 26S(FX) + 2abS(X) 

To determine the values of a and b which will make ]S(F — F')^ a minimum, 
the partial derivatives with respect to a and h must be obtained and set equal to zero 

— — = 2na - 2S(F) + 26S(X) 

d(i 

= 262(^2) - 2E(F.Y) + 2a2(A') 

db 

Setting each equal to zero 

2na - 22(F) + 262 (X) - 0 

na + 2(X)6 = 2(F) 

262(A’2) - 22(FA) + 2a2(A') - 0 

2(A')a +-(A> - 2(AF) 

Equations (I) and (11) are th(*n tlu* requinul normal (equations, to be solved 
simultaneously, as givcMi in footnott' 2 of Chaj)t(a’ 5. 

Solving the iiorinnl ('(luations for a and 6, th(‘ steps are as follows: 


(I) 

71(1 + -(X)6 

= ^01 



... -eo 


a 4- — 6 



n 

ti 



^01 2(A-) 


(1 

~ h 



n 71 

d') 

a 

i! 

1 

an<i 



(II) 


Y) 


*This developnHMit follows that siigg(‘sted in (1. U. \'uU% An I ntrodaction to the 
Theory of Slatutics, ( hapter XVII, HU), pp. 344 345, and ChnpttM' XI, M2, PP. ‘-^10 
211, in the sixth editit)n, C. Clridin S: Co., I.tiL, London, li)22. 


(I) 

and 

(ID 


490 


APPENDIX 2 


substituting the value of a given above 

S(X)Mj, - + S(X2)6 = S(XF) 

nMJMy - nM^J> + S(X^)6 = S(XF) 

S(X2)6 - nM|6 = S(X7) - nM^My 

hence 

^ S(X7) - nMJfy 

~ lix-) - nMl 

Equations (F) and (IF) are then the equations given in the text as equations (10) 
and (9), to compute h and a. 

Note 3 (Chapter 6). In computing for a variable X, or Xxy for two variables 
X and 7, the values are not at all affected if some constant value is subtracted from 
each item of the series before the computations are made. 

Let c represent the constant subtracted, and Di represent X — c. 

Then 

D? == (X - c)2 - X" - 2Xc + 

SDf = 2X2 „ 2 c 2X + 7^c2 

and 

Mdi = c 

Now from equation (5) 



= XX^ - nMl 


Let 

d — Di — Mdi 

Then 

ScP = XDl - nMl^ 

= XX^ - 2cXX +nc^ - n{M^ - cf 
= SX® — 2mMx + — nMl + 2ncMx — m? 

= XX‘^ - nMl 

Hence 

By an exactly similar process, if Di = Xi — ci and D 2 = X 2 — C 2 , it can be 
proved that 

S(di(f2) = 2(xia:2) 

Similarly, if any variable, X, is multiplied or divided by a constant, c, the effect 
will be to multiply or divide by and l^xy by c. Hence, where X is divided by 
c, (Tx and hxy will be divided by c, but hyx will be multiplied by c. 

For 

S - = (SX) i 

c c 



NOTE 3a. (CHAPTER 7) 


491 


and 


And 


Hence 


(f)- 


JW./C = Q j 

nM%„ = S(X>*) - n{Ml) 

2^2 _ ^^2 
C2 

2x2 


&) 


So 


Likewise, 


Hence 


O'ar/c = 




2^^^ Y = S(Xr)^ 


.(f).- 


nM^lcM,, = 2(.YK) - - nM, 
c 


(0"' 


2A'r - nMrMy 


Similarly, 


Lx?/ 

c 






2x?/ _ 


^>(x/r)j/ — 


-Lx?/ 


-L//“ 


Note 3a (Chapter 7). '^J'o proven that ry^, lis cornputtMi by equation (23.1), 

equals r^^x, as c()m{)iil,(Hl by (Mpiation (27), 

In equation (23.1) 



<ry 



492 


APPENDIX 2 


( 1 ) 

and 


7 yf is the <T of the series of values of Y' estimated from the equation 
7' = a -h hy^ 

Each y' - Y' — Myf 

” ^yxi.^ — ^x) ” hyx^ 

XiXY) - nMxMy 

hyx ~ 


2(X'0-n(M|) 

or in terms of departures from the mean 

h _ 

2x2 

Substituting th’s value in equation (I) above, we find that 
each 2 /' = ^ 

each {y'f = [^^ 2 ] 


S(!/'y 


^ rcs^j/)^~ 


= 


2*2 


2 _ 
= 


(S^ 

ri2x® 


(equation [9]) 


and 

(II) 

By equation (23.1) 

(III) rlx --2 

Substituting in equation (III) the values of cr\> given in equation (11), we have 



2 (Sxy)2 

""" " n2x2 ' 


C^xvf 


II 

hence 

2 (?xyf 

■ nVM 

and 

2xy 

^2/x ~ 

na-xCTy 


2(X7) - nM:,My 

” V[S(X2) - nM%\ [2(^2) - nAfjl 


But equation (27) gives 



NOTE 5. (CHAPTER 7) 


493 


which, stated in terms of departures from the means, becomes 

"Zxy 

Tyx - /"“ 2-2 

V Tlffxnay 

Hence equations (23.1) and (27) are identical. 

Note 4 (Chapter 7). To prove for a very simple case that r\x measures the 
proportion of variance in Y explained by X, Let a, fe, c, etc., be series of variables 
with o-a = crt = Cc, and with all intercorrelations such as Tab, f'act etc., = 0. 


Let 


Then 


r = a +6 + c 
X = a -f 6 


ja Pyx 

^yx ~~ 2 2 

^x^y 


Here the symbol pyx is used to represent 


\ 

n ) 


Since 


Similarly, 


By similar proof, 
Hence 


each {y){:x) = (a + -f c)(rt + h) 

= a“ + 2ab + dc -f 6“ -f- he 

:^isih), i:(ar), '^{hc) = 0 
:>:(//) (r) 

iir) - (« + 6 d- r)2 

— <r -f 2<ih -f- 2ar + Ir + 2hc H- c“ 

- IVr^ -f xt/ 


{ou -f al -f H“ f^it) 

__ n "f- o /" 

I' 

= I (sillC(* a a = <Tf, - a,.) 


Similar results will be obtained for other simple combinations of elements. 

Note 6 (Chapter 7). Inst.<*a(l of ineasuririK the prcHerure of (^orn'lat ion by (com- 
paring the standard (hcviatioii of the (*stimattal vhIikch with the standard deviation 
of the actual, the; amount of almnicc of correlation may bic m(‘asur(‘(l by eomparing 
the standard deviation of the residuals — the standard (error of estiniabc — with the 



494 


APPENDIX 2 


original standard deviation. Thus in the horse-feed problem, where the coefficient 
of correlation — Tyx — was found (before adjusting) to be equal to 3.47/7.92 or 4-0.44, 
for the straight-line relation, the standard error of the residuals was equal to 7.13. If 
we express this in proportion to the original standard deviation, it gives us the ratio 
(Tz/cTy, or, in this case, 7.13/7.92, which equals 0.90. This term has been given the 
name coefficient of alienation,^ since it measures the lack of correlation in exactly the 
same way that the coefficient of correlation measures the presence of correlation. 

In this particular case, the (unadjusted) coefficient of correlation is 0.44, and 
the coefficient of alienation is 0.90. The total of the two is considerably greater 
than 1.00. This, therefore, warns us that we cannot regard the coefficient of cor- 
relation as giving the percentage of correlation, or the coefficient of alienation as giving 
the percentage of absence of correlation. Except when one of the two values is equal 
to 0, the sum of the two will always be greater than unity. 

If we look back to the values from which these coefficients were computed — the 
standard deviation of the original dependent series, the standard deviation of the 
estimated values, and the standard deviation of the residuals, it is easy to see why 
this is so. The standard deviation of the original values, <xy — 7.92; the standard 
deviation of the estimated values, ayt, = 3.47; and the standard deviation of the 
residuals, cr^, = 7.13. When we add the last two together, we find they equal more 
than the original standard deviation. Therefore, when we express each of them as a 
percentage of this original standard deviation, the sum of the two values is more than 
unity. But if we square these standard deviations, we find that crl = 62.73, 
cr^, == 12.04, and — 50.84. Here plus the al, 12.04 plus 50.84, is equal to 62.88, 
practically identical with 0 - 1 , 62.73.'^ This will always hold true, as each individual 
y' 4* 2 ! = 7. It has already been proved (Note 1), that when Y' and z are not 
correlated, and Y — Y' + z, that then 

2,2 2 

a-yr CTz — Cy 


which we have just observed to hold true in this case. 

If we measure the amount of correlation by dividing o-J, by o-^, and the lack of cor- 
relation by dividing o-g by ay, we shall have two measures whose sum will always equal 
unity, so that when we know what one is, we can tell the other immediately. These 
values, and are known respectively as the coefficient of deierniination 

and the coefficient of non-determination. They equal the square of the coefficient of 
correlation (r^), and the square of the coefficient of alienation (/c‘^). In tins case 
these values, (0.44)‘^ and (0.90)^ are 0.19 and 0.81, showing that (if the adjustment 
for the number of cases is ignored) 19 per cent of the variance in feed is associated 
with days worked, and 81 per cent is not so associated. 

The adjustment of the coefficient of non-determination for the effect of small sam- 
ples, to obtain an unbiased estimate of the most probable value of k in the universe 
from which the sample is drawn, is 



( 102 ) 


^ Truman L. Kelley, Statistical Method, pp. 173-175, The Macmillan Co., New 
York, 1924. 

^ The slight difference is due to rounding off decimals in entering the Y' values. 



NOTE 5. (CHAPTER 7) 


495 


Applying this adjustment, for the horse-feed problem becomes 0.86, indicating 
that in the universe, it is likely that 86 per cent of the variance is independent of the 
days worked. 

Since the several measures of correlation are all derived from the standard devia- 
tions of K, y', and z, certain mathematical relations always hold among the unad- 
justed coeflScients, as is shown following: 


or 


also, 


hence 

or 


- 



^xy -\-hxy ^ I (k = the coefRcient of alienation) 


dxy 4 " ^xy — 1 


= Vl - 


2 2 
Z? _ 1 

J " ^ ~ J 

Cy <Ty 


1 


r 


2 

xy 


2 _ /I «2 \ 2 

— (1 ^xyJf^y 

CTg = <Ty^ 1 - riy 


This last equation is useful in calculating Syx, the standard error in estimating Y 
from known values of A", when only the standard deviation of Y and the correlation 
of A" with Y are known. As is shown in Chapter 8, the coedicient of correlation 
can be computcHl directly without first computing all the estirnatod values of Y 
(the Y' values) or without computing the individual residujils. When the correla- 
tion coeificient is thus computed, this last equation provid^^s a short cut to tell what 
the errors in estimating vidues of Y from the known values of A" a,ccording to the 
straight-line relation are likely to be. 

Similarly, with curvilinear relations, 


and 


Hence 


and 


4 - (Tz'' 



-- 1 



- 1 


o 



These relations hold pr<‘cisely true only when p is caleulabal for a mat,hertiati(\’dly 
determined regression curve. iM»r freehand curves, th<\y ar(‘ only ai)proximately 
correct. 



496 


APPENDIX 2 


Note 6 (Chapter 12 ). The normal equations, to determine the '‘best ” regres- 
sion values for two or more independent variables, are derived by exactly the same 
process as given in full in Note 2. Thus to determine the constants in the equation 

Xi = a -i“ 62X^2 "I" &3X3 -h 64X4 

The value to be made a minimum is 

S(Xi - a - 62X2 - 63X3 - 64X4)2 

Differentiating this with respect to a, 62, h, and 64, setting the partial derivatives 
equal to zero, and transposing give the normal equations, stated in terms of sums of 
Xi, X2, etc. The equation may also be stated in terms of deviations from the means 

Xi = b2X2 + ^3X3 "i" 640:4 

This, by the same derivation, gives the normal equations as given in the text 
(equation [ 38 ]). The constant, a, is then foimd separately by equation ( 39 ), which 
represents simply the separate solution of the first normal equation given by the 
first derivation, 

SXi = na + 2(X2)62 + S(X3)63 + S(X4)64 

Hence 

2X1 2X2, 2 (X 3 ), 2 (X 4 ), 

a 02 03 04 

n n n n 


This readily reduces to equation ( 39 ). 

Note 7 (Chapter 13 ). CoeflScients of partial correlation are usually defined by 
the formula 

ri2 “ ri 3 r 23 


For coefficients with more variables eliminated, such as ri2.346, for example, this 
becomes 


^ 12.346 = 


ri2.34 — r15.34r26.34 
V (1 ri6.34)(l — ^' 26 . 34 ) 


To determine the coefficients with several factors held constant by this method 
involves a lengthy process of elimination, variable by variable; and for that reiison 
the method presented in the text is preferred as shorter, simpler, and more readily 
subject to checking. 

Note 8 (Chapter 13 ). It can be proved that the coefficient of multiple deter- 
mination (R^) measures the percentage of variance ascribable to the several inde- 
pendent factors for certain simple cases. Thus assume four variables, A, Bj ( 7 , D, 
with all intercorrelations equal to 0, and all <7 equal. Let Y A + B C, Then 
correlate Y with A, B, and D. The regression equation will work out 

7 = a+A+B[+0(D)] 

Computing Ey.abd by equation ( 46 ), 

1,2 hya.hd^yO> “b hyh.ad^y^ + hyd.ai^yd 

Ry.ABD - ^ 


Each 


{y){a) - (a + 6 -b c)(a) = + a6 + ac 

2 (3/a) = 2 a^ + 2 a 6 + 2 ac = 2 a^ (Since rob — Tac 0 ) 



NOTE 9. (CHAPTER 13) 


497 


Similarly 

2(j/6) = 26®; 2(vd) = 2d* 

And each 

(yi) = (a + b+ c)* = o* + 2o6 + b* + 2oc + 2bc + c* 

and 

2(v*) =2o*+Sb*+2c* 

Hence 

j (l)(2a*)4-l(2b*)+0(2d*) 

2o* + 26*+2c* 

And since all o-^s are identical, 

Ry.abd = i 


In this case, then, when 7, composed of three equally variable non-correlated 
elements, is correlated with two of those elements, and with one other equal element 
which is not represented in Y and which is not cori’elated with elements present in 
y, the multiple determination of Y by the two elements (A and B) is found to be f. 

Similar results will be secured for other experimental cases which may be set up. 

Note 9 (Chapter 13). The coefficient of part correlation was first worked out 
by B. B. Smith, in collaboration with the present author, and was first published in 
Correlation theory and method applied to agricultural research, pp. 67-60, Bureau 
of Agricultural Economics, U. S. Department of Agriculture, August, 1926, (a mimeo- 
graphed publication). 

The formula for part correlation coefficients is derived as follows; 

I^et 12^34 equal correlation of X 2 with xi — biz. 2 ^xz — ^>i4.23iC4 

Let 

z — Xi ^ — hsxz — hiXi 


Then 

each 

each 

So 


2 + ^2X2 = Xi — 63 X 3 — 64^4 
+ h2X2)(X2)? 


( 12 ^ 34 ) 


2 __ 


IS(2 + h2X2n'^X2\ 

{z + h2X2){x2) “ ZX2 “h ^ 2 X 2 

2(2 4 - h2X2)ix2) == ly^xl (rw2 = 0 , hence 2zx2 = 0 ) 

(z 4 ^ 2 x 2 )^ = 2^4- 2h2X2Z 4" ^ 2 X 2 , 

2(z 4- h2X2? = -f hlZx] 

, ,2 _ hl(Sxi)^ ?>lSxl 

(i2r34) 6i2;xi)(2:xl) 22= + b^xl 


1 1 

no\{\ " /t^l.2.34) 
62 ^X 2 l)2n<r\ 


1 4- 


<rt(l 


— /^1.234) 


62^2 



498 


APPENDIX 2 


Note 10 (Chapter 13). The coefiScient of partial correlation, as used in biomet- 
ric work, is employed to measure the correlation in an hypothetical universe, from 
which all variation due to changes in the eliminated factors has been excluded. This 
correlation may then be compared to the simple correlation of the two factors, to see 
whether the closeness of relation is improved or not by excluding the variation in 
both variables associated with other factors. With the coefficient of part correla- 
tion, on the contrary, all the original variation in the independent factor is left in 
it, and only the dependent factor is adjusted. 

Note 11 (Chapter 13). Separate determination. Neither the coefficients of 
partial determination nor the coefficients of part determination equal, when totaled, 
the multiple determination of Xi by X 2 , X 3 , and X 4 . That is b cause both these 
types of measures are computed on bases which change from variable to variable; 
hence their sums, when several are added together, have no mathematical significance. 
There is, however, a third type of coefficient, which parcels out among each of the 
several independent variables that part of the variation in the dependent variable 
which each one of them seems able to account for, when estimates of the dependent 
variable are made from all of the independent variables. To distinguish this type 
from the coefficients previously presented, they may be termed coefficients of separate 
determination. 

Using di 2.34 to represent the separate determination of Xi by X 2 , when Xz and X 4 
are also considered, we may compute it by the formula 


612. .34 (2x1X2) 

El. 234 

S(x!) J 

_Ei.234_ 


Each of these values has been used previously in computing the coefficient of 
multiple correlation, so the determination coefficient may be readily calculated. 

When the several coefficients of separate determination are added together, their 
sum is equal to the coefficient of multiple determination, R^. Comparing the last 
equation with equation (46) for we readily see that 


El. 234 = 


bi2M(^XiX2) hiz,2^(XxiXz) 614.23(20:1X4) El. 234 

_ 2 (x?) X(xl) 2 (xb JLei.234. 


The three terms on the right are the three coefficients of separate determination, 
di 2 . 34 , di 3 . 24 , and ^ 14 . 23 . These coefficients are the simplest to compute of any of the 
three types which have been discussed, the computation of their values being readily 
made a part of the process of working out E and S. 

Working out the last equation for acres and income, in the 4-variable problem, 
we find that it becomes 


"(1.20584)(0.71)” 

“ 0.806 ” 

0.690 

(272.76) 

_0.8366_ 

“ 228.19 


1.00302 


The corresponding values are 0.630 for the separate determination of income by 
cows and 0.171 for the separate determination of income by number of men. 

The coefficients of ''separate'^ determination are the easiest of all to compute, 
and have the further advantage of adding to a definite sum (W), and hence being 
directly comparable one with another. The disadvantage in their use, however, is 
that under certain conditions the value of one or more coefficients will prove to 
be negative. Off-hand it seems , difficult to explain how the “determination” of 



NOTE 11. (CHAPTER 13) 


499 


any variable can be less than nothing. (This result will be obtained whenever the 
gross or apparent correlation and the coeflScient of net or partial regression are of 
opposite sign.) The explanation is simple, however. Although the total variation 
in the estimates of the dependent variable is obtained by adding the contributions 
from the several independent variables, it does not follow that all variables will be 
influencing the estimate in the same direction at the same time — all tending to give 
low values when the actual value is low, or all tending to give high values when 
the actual value is high. It sometimes happens that one variable may tend to 
work counter to the other variables, usually preventing the final estimate from 
going so low as it otherwise would when tfce general effect is downward, and tending 
to keep it from going so high as it otherwise would when the others are forcing it up. 
It is under such conditions that negative coefficients of separate determination are 
obtained ; they do not m ean that the variable has no significance, but that its influence 
is usually exerted counter to the influence of other variables. 

When there is veiy high intercorrelation between the several independent varia- 
bles, the coefficients of separate determination may vary quite erratically, and hence 
become of little significance. Under such conditions other measures of the individual 
importance of the several factors will need to be employed. 

Although nothing is known of the sampling error involved in determining coef- 
ficients of separate determination, since they are computed from standard deviations, 
product sums, and net regression coefi‘cients, their standard error must be some 
function of the standard error of these other coefficients. It is under the conditions 
noted in the preceding paragraph that net regression coefficients have the least 
reliability, so it may be that a problem which fails to yield reasonable separate 
determination coefficients may also fail to yield reliable values for the other measures 
of determination. 

There seems to be evidence that coefficients of separate determination are less 
stable, and more subject to random error, than any other measure of the importance 
of individual factors. On account of their ease of computation, they have been much 
used in the past; but it is doubtful how much confidence can be placed in them. For 
that reason it seems best to use the other measures, and discard this measure until 
its reliability has been more definitely determined. 

The relation between l)eta coefficients and coefficients of separate determination 
may be shown algebraically. 

The normal equations for determining the regression coefficients 


(I) 


2):r262 + -f* = 2a:iT2 

^X2X[ih2 + '^xlbn Sa:3.r4?>4 = 
'^X2Xih2 + SX 3 .T 463 + ^xlbi = '^XiXA 


may also be written after dividing through by n, the number of cases, and dividing 
each line and column by the correspoiKling standard <levia,tion. Solutior\ of these 
equations, which are shown below, then gives the values for the partial (net) Ix^ta 
coefficients. 


(II) 


/32 + ^23/33 4- ^24)84 = Tx 2 
^23^2 + ^3 4“ ^34/34 — ^13 
^"24^2 + ^34/33 + /34 = ri4 



500 


APPENDIX 2 


For the first set of equations (I) the separate determination of Xi by X2, ^12.34, 
is given by the equation 


, l)2^XlX2 ( . 

dviM = 2^2 


Z>i 2 . 34 Sa;ia; 2 \ 

M ) 


In terms of the values given in the second set of equations (II), the coefficient 
of separate determination would be computed 


di2.34 - i32?'i2(i-e., = i3l2.34ri2) 


Substituting the value for ri2 given by the first equation of the second set, we 
see that this becomes 

dl2.34 = ^2(^2 + ^23^3 4“ ^24/34) 

= /3l 4" ^23/32^3 4- ^24 j82|94 

Similarly, 

dl3.24 = /33 4- ^23/32^3 4" 

and 

dl4.23 = jSl 4- ^*24/32)34 4" 


It is evident from this that each coefficient of separate determination consists of 
one portion, / 3 ^, which is, as Dr. Sewall Wright named it, the “direct determination'' 
by that independent variable; plus (or minus) a pro-rated share of the joint deter- 
mination of that variable with each other independent variable. Since r23/32j33 con- 
tributes equally to both di2M and ^13.24, this “joint determination" is simply divided 
equally between both independent variables. As only the direct determination 
(iS^) can be said to reflect the separate influence of the particular independent varia- 
ble, the further attempt to allocate or split up the joint influence is unsatisfactory. 
For that reason, the several betas (or their squares) seem the best measures of the 
separable importance of each variable, the combined influence of variables acting 
jointly being left out of the distribution. 

This explanation follows that developed by H. R. Tolley, and presented in full in 
the bulletin by F. F, Elliott, Adjusting hog production to market demand, University 
of Illinois Agricultural Experiment Station Bulletin 293 , 1927 . See also Sewall Wright, 
Correlation and Causation, Journal Agricultural Research^ Vol. XX, No. 7 , pp. 557 - 
575 . 

Note 12 (Chapter 13 ). Given the multiple regression equation 
Xi = a 4 " 1)2X2 4 " ^>3X3 4 “ &4X4 


let R(Xi-b2X2).ZA represent the correlation between [Xi -- 62X2I, and X3 and A''4. 

To find the formula for this correlation: 

Using departures from their means for all variables 


2 = ajj — ^2x2 — hzxs — 642:4 


hence 


63x3 4 " 64x4 — xi — 62x2 — z 



NOTE 12. (CHAPTER 13 ) 


501 


The required correlation is therefore that between 

[xi — and [rci — ” A 

From equation ( 27 ) it is equal to 

2[(a;i ” h2X2){x\ - 620:2 — z)] 


Each 


and 


— 620:2)^2 (o^i ““ b 2 X^— z)^ 

[xi — 620:2) (a:i — 620:2 — z) - (xi — 620:2)^ — 20:1 + 6120:2 
S[(o:i — 620:2) (0:1 — 620:2 — 2)] = S(a:i — 620:2)^ — 'Zizxi + 6iSzo:;, 


Since 2 is uncorrelated with 0:2, 220:2 = 0.. 

The value of 2 zo:i may be evaluated as follows: 


But 
hence 

Therefore, 

Similarly, 
may be vshown to equal 
Hence 


0:1 = 620:2 + 630:3 4 - 64x4 + z 
ZXI = h2ZX2 + 6323 4 - 64ZX4 4 - Z^ 

22.0:1 “ 6222x2 4 “ h^zxz 4 " 64S2X4 4 “ 2 z^ 

= r ^4 = 0 


And 


2[(xi — 62x2) (0:1 — 62x2 — z)] = 2(xi — 62x2)^ — 22^ 


2(xi — 62x2 — z'f 

2(xi - 62x2)2 - 2z2 

2(xi - 620:0)^^ - 22^ 

— 62x2)'^^ v^2(xi — 62x2)“ — 22^ 

[2(xi - 62x2)^ - 22" I" 

■" 62X2)2|[2(xi - 62x2)2 - Xz^] 

2(xi - 62.r2)" - 2z“ 






- 62x2) 


2 (0:1 - 62x2)" 


1 - 


1 — 


2 

Oz 


<rl — 262/; 12 4- 62X2 

<Ti(I — A*T. 2 ;m) 

2 0/ ^^1^2 , ,2 2 

O-I — 262 4 " 62<T2 

11 



502 


APPENDIX 2 


Note 13 (for Appendix on Methods of Computation.) To prove that the equation 

(101) 

°'& 13-24 ~ ^ 1.234 


gives the same value as that given by equation ( 74 ) 


o- 6 i 3 

# 


_ I w. 


1(1 ~ -^3.24)^ 


( 74 ) 


For a problem in two independent variables, C33 is obtained by the simultaneous 
solution of the equations 

2(0:2)032 -h 2 (a; 2 a; 3 )c 33 0 

2 (o; 2 a; 3 )c 32 4-2(0:3)033 = 1 

Solving by the Doolittle method, we have 

2(0:2)032 -h 2(0:20:3)033 — 0 

2(0:20:3) ^ 

— 032 zr~r~ ^33 “ 0 


2(0:2x3)032 4 " 2(0:3)033 = 1 

. [2(0:2x3)12 _ 

— z ( 0 : 2 x 3)032 — s(?) — ^ 


and 


C 33 


2 n l 2 (x 2 X 3 )in 


1 _ 2s [2 (0:2X3) 

— Z(X3} V/ n2 

C33 2(X2)^ 

-j 

= — rlz) 


Hence 


C 33 = -- 2 ; 


7 ^^T 3(1 - ry 

Substituting this value for C 33 in equation ( 101 ), we obtain 
0 '& 13-2 = Sl. 23 >/S 


\n(rl{l - rls) 


4 


SI22 


n<r|(l - riz) 


NOTE 13 . (FOR APPENDIX ON METHODS OF COMPUTATION) 503 


This is seen to be identical with the value given by equation (74), when written 
for the corresponding coefficient. The equations to determine C 22 for a problem of 
three independent variables are 

2(0:2)022 -1-2(0:20:3)023 -1-2(0:20:4)024 = 1 

2(0:20:3)022 4 - 2(0:3)023 - 1 - 2(0:30:4)024 = 0 

2(0:20:4)022 4 - 2(0:30:4)023 4 - 2(0:4)024 = 0 


If these equations are solved simultaneously, it will be found that the value for 
022 will be 


022 


1 

n(r|(l — /2I.34) 


and substituting this in equation (101) will again show that result to be identical with 
equation (74). This same proof may be carried through for any number of inde- 
pendent variables. 

Note 14 (Chapter 19). The standard error of an individual forecast is composed 
of the error of 'points along the calculated regression line plus that of individiuil esii- 
h'tt:l(s around that line. The standard error of the former is given by equation (70.1), 
page 310 

cry' = 4- (70.1) 

whil(^ that of the latter is the standard error of estimat(^, Sy.x (equation 28). 

Assuming that individual errors of calculated points along the lino arc uncorre- 
lat(‘d wilJi dc^partaires of individual forecasts from the line, the square of the sttindard 
error of an individual fonuiast is the sum of the squares of the standard errors of the 
lAvo components, as follows: 

= <rif , 4- (o-fc 

y yx 

2 _ rvJ 

‘^liidlvldual OHtlinuU'H ~ 

h(Mu:e 

^y'~y “ 4“ ^y-x (75.1) 

or 

(JY'^Y = (r\f 4- (o-/, a^)“ 4- 

y ' yx 


( 75 ) 



APPENDIX 3 


CHARTS FOR INTERPRETING OR ADJUSTING 
CORRELATION CONSTANTS 

Reliability of small samples. The accompanying figures will serve to facilitate 
and simplify many of the computations .which are discussed in the text. 

Figure A is an extension of Table A, in Chapter 2. For random samples of vary- 
ing sizes, it gives the average proportion of samples of each given size in which the 
observed mean will miss the true mean in the universe by more than the stated 
multiple of the standard error of the mean, as computed from each sample. The 
figure is drawn for 2, 3, 4, 5, 6, 7, 8, 10, 12, 16, 20, 30, and oo observations, and 
may be used for any desired multiple of the standard error from 1.0 to 6.0. By 
interpolation, values for intermediate sizes of samples may be read. The figure is 
read by entering with the desired multiple of the standard error (shown at the 
bottom) and noting the ordinate where the line for the given number of observations 
intersects that abscissa. The ordinate then gives the average proportion of samples 
in which such a departure will occur solely by chance. The figure may also be 
entered with a desired probability and the given number of observations, to deter- 
mine what multiple of the standard error must be taken to give that degree of relia- 
bility. Thus if with 10 observations a reliability of 0.95 was desired, the figure 
indicates 2.26 times the standard error. That is, with 10 observations, in 5 samples 
out of 100, on the average, the true mean will not come within the range covered 
by observed mean ±2.26 S.E., if the sample was drawn under the conditions assumed 
in random sampling. 

Just as with Table A, Figure A may be used to judge the reliability of certain other 
coefficients, by subtracting 1 from the number of observations for each additional 
degree of freedom removed in determining the constant. For coefficients of simple 
regression, 1 must be subtracted; for partial or multiple regression coefficients, sub- 
tract the number of independent variables from n; for curvilinear regressions, 
subtract (m — 1). 

Figure A is based upon the results given by “Student” in his article, New tallies 
for testing the significance of observations, Metron V, No. 3, pp. 105-120, 1925. 
It is comparable to Fisher’s t table, with n here equal to Fisher’s n', or his + 1. 

Reliability of observed correlations. Figures B, C, D, and E have been discussed 
in Chapter 18, pages 319 to 324. These figures provide a ready means of judging 
the probable minimum value for the correlation in the universe, with iiny observed 
value and any given size of sample. The chart is entered with the observed correla- 
tion as abscissa; the ordinate for the intersection of that abscissa with the curve for 
the given size of sample gives the probable correlation. Thus if a coefficient of 
simple correlation, Vxy = 0.65, is obtained from a sample of 22 cases, the researcher 
will know from Figure B that, if he makes the statement that the true correlation in 
the universe is at least 0.38, he will be wrong in only 5 per cent of such statements, 

504 



RELIABILITY OF OBSERVED CORRELATIONS 


505 


Proportion of samples 
in which specified 
departure will occur 
by chance 


^ cn 



1.0 1.5 2.0 2.5 3,0 3.5 4,0 45 5.0 55 5.0 

Number of times the standard error 


Fig. a. 'rh(' i)n)pnrti()ii of random samph'.s in whi<^h l*he obw'rvod mean will misa 
l.he t.nui nioiin by moio Ilian liio wfalc'd of llio standard error computed 

from the sainjih', for samples wilh 2, 3, 4, 5, 6, 7, 8, 10, 12, 16, 20, 30, or oo 
obscrvaliouH. (I’o apply to cue(ru'ienl.H of regression., sec footnote to Table A, 

page 23.) 




506 


APPENDIX 3 


True Minimum correiaiion in universe, for varying 

Correlation observed correlations and size of sample 


f.OO' : :g 


■iiiiiaM 

■■■■■■■■a 

mnivMiBM 

uggsss 

■■■ ■■■■■■ 
■■■■■■■■a 

[•■■■■■■MR 

siisESEEKSKESS 

SSSEB8ES8ESS8SS8 

RaBBaaSRSBaSSSaB 

SEKi :s:SSESESS 

B!5SS;:!Ba;SSa;aSi;Ba;:y 

SaSSSSShSSSSSsSi^aaia;' ^ 

aiBB BBBBrBiBBaBaBBBaRBaaBaBiRB'-':'.i,' /§•’■ 

BBaBBiBBIBBBBBaBBaHBBaBBBaBBBa:.''/i;<fB 
BBBB BlaBBBBBBB iBBBaaBBBB BBBr 

.‘90 ^ M 

i±:±^SSS:: 

■■■■■«■■■ 

IIMIBBilKii 

isssssssss 

■■■■■■■■■■■■MM 

BBiSRBBBBBaRBaBBBBBBBBBBBBB-.';. f. 4'B?ial 


llili;; 

!■■■■■■■■ 

■■■■■■■■■ 

■■■■■■■■a 

lan ■■■■■■ 

!■■■■■■■■■ 

IHIIUUiliHI 

■■■■■ ■■■■■■■■■■ 
■■■■■ ■■■■■■■■■■ 
■■■■■ ■■■■■■■■■■ 

BSaBBiBaaaa li!aaaiaa*a**i*'//i/ f 

S^raSSSi \ iSSSSSSSr 

SSSSSSSSSS: S ESSp 

.ow ::::m 

SIMPLE CORREUTION 


iBBBBBBaBaaB M Baaar.^r/j'ia;Bia'iBai,aR£ 

laaBRBBaaaBB a nnKf%T.raJmif'*mtmmu*mmm 

IBSSSSSBSBSE B K^i:i«i!:;f48^BB8;BBB8 
IBS8SBS8SS88 Br^^&raEKiBBfiSSEE 

.70 V:':M 




■SiBBBBBEaaariliMrViajiSSaEaBEaBBSSRMaiBB 

1 aaaiBRBi^^>4r«aaaaaEaa6aBraaaRa 
BBBiBiiBvrBr/f'^arjBariB BBaai taaaa a 

BSBiSpm^S^KKMBSBBBEBESSS 

S8SS^8%E;U8iS8'KS88Sr888S8: 

BBB^^vl^iaKIllaaaariiaaaRWiaaaaaa 

.60 

• SO pM 

.40 

■ 

mmmm 

■ ■■M 

mmmm 

■■■■ 

■■■■ 

«■» 

■iiHi 

■■■■ 

■■■■ 

■■■■ 

■■■■ 

Kg 

[■■■1 

immmm 

IHM 

iHStl 

l:i:i 

iSSK 

■hSi 

ssss 

1 

PI 

SSSS! 

»gi 

■■■■■■! 
mm ■■■■■■! 

!SS SS5SSSI 
iSS SiSSbiii 

«■■■■■! 

mm 

■■■■■■' 

!■■■■■■■■■! 

ssssssssss 

!■■■■■■■■■ 
!■■■■ ■■■■ 

»»! SSSS 

!■■■■ aaRR 

■BBBBBBBBBiBaiaB BBai 
■BBiBBBBBBBBaaB BBaa 
BBBBBBBBBBBaBaa BBlp 

ESSSSSbSbSSSKS S%ib. 

aaBBBBBBBBBBBBfl rM'J* 
RBaaBBaaBBaaBBR JbmB* 

mmmmmmrnmmmmmmwdmr^t^ 

■■■■■■■■■■■■^■■■MM 

mmmrnmmmmr^murdmfjmw^ 

SaSSrlBSaiBB^^BKiBBSS 

BBBrar.aB)raB&air,aaBBr 

BE!S»SS^”i4S^BarS|l 

y§^i»nk»! 

KEKBE^^SSKgSBBBBSSiBBSSSS::: 

IESiS8^;K8KESSB8S888M888BS8E:: 

ir.aaBriiMraiiBiiVBBHBBaiBi'iBaRaBBBaaa 

'.BBaryiair^aBaBMBBaBBBSBfaaRBaBBaaaa 

;88iB8^.8B88E^EiaBg88B88KSSSSS8S:8S 

OiSSSSSp«:SiiSSSr::SSSSpraS 

^SSSS^MSSraSSSSiSSHS 

.30 

' H 

•KK 

iHK 

SKIS 

Issil 

igsi 

11 

ggi 

■SaSa Sa^% 

!8s:s 

lli 

EK^EKiESEIKSK^ 

SSitgSte»i 

MB 

miiiiiiiiiSai 

p 

:zo 


iSSSi 

■■iia 

iils 

PP 

ISiSi 

ill 

mb' 4 araaa 

BraiBaBBMiBBVBBBBaBBI 


lifla 

1 

Biasi 

• 10 : ::::: 


|l*iiK 


i^a raaaa 

■Hr.ivKiEMa 

ilnii 

jpiPaliif If j 


ill 


1- ---- 

SSSS' 


ESK aSaST! 

aria aaBfi 

iBBBafBaaBBBBaBfaaaRi 

sggg||8|B 111111 



A ; 

1 

ivX; 

Kssr 

^s^ 

iviai 

riKSf 

Ks: ES8K 
KaS ssrifis 

see: ssE: 

aaaaaaaiaa 

iMl 



-tlt^ 


0 .10 .20 .30 AO .50 .60 .70 .80 .00 1.00 


Correlation observed in sample 

Fig. B. Under conditions of random sampling, one sample out of twenty, on 
the average, will show a correlation coefficient with a ± value as higli as tha,t 
^'observed in sample/^ when drawn from a universe with the stated tiue correlation, 


RELIABILITY OF OBSERVED CORRELATIONS 


507 


True Minimum correlation in universe^ for varying 

Correlation observed correlations and size of sample 



Pia. C. TJndor coiKlitionH of random sampling, one sample out of twenty, on 
tho avorji|^(% will show a in\dlij)lo correlation as hi}>;h a,s that “oVjsorvcd in sample,” 
when drawn from a nniv('r.s(' with th(‘ stat,<'d true inulti])lo c.oiTolation, in the 
case of niiili.ii)lu correlation with three independent variables. 


508 


APPENDIX 3 


True 

Correlation 


Minimum correlation in universe, for varying 
observed correlations and size of sample 



Correlation observed in sample 


Fig. D. Under conditions of random sampling, one sample out of twenty, on the 
average, will show a multiple correlation as high as tliat “obsoi-v(Ml in sample,” 
when drawn from a universe with the stated true multiple correhition, in tlie 
case of multiple correlation with five independent variables. 





RELIABILITY OF OBSERVED CORRELATIONS 


609 



Correlation observed in sample 


Kig. E. Under conditions of random sampling, one sample out of twenty, oji 
(lie av('r:i^:c, will show a imdiiple coiTcla.t.ion as liipih as that “ol)served in saini)lf’,” 
wlu'u drawn from a imivcrsc* with Hit' s(.a.i.(‘d I, roe multiple correlatioii, in the 
case of nudtiple correlation willi seven independent variables. 


ya!ue of^j^ 

1.00 .90 .80 .70 .60 .50 .40 .30 .20 .10 0 



Value of 

Fia. F. This chart provides a graphic means of calculating the adjusted coeflicieni 
or index of correlation, as shown in formulas (25), (26), (47), and (66.3). 

510 


ADJUSTMENT OF CORRELATION FOR SIZE OF SAMPLE 511 


on the average. Figure C applies to i2i.234, Figui’e D to i2i.23466, and Figure E to 
I2 i. 2346678. Valucs for 2, 4, and 6 independent variables may be obtained by inter- 
polation. The figui’es are based upon the researches of R. A. Fisher, summarized 
in his publication, The general sampling distribution of the mxiltiple correlation 
coefficient. Proceedings of the Royal Society y A, Vol. 121, pp. 655-673, 1928. The 
tables in that article assume a large sample, and therefore give only approximately 
correct values when applied to small samples. For that reason the values shown 
by Figui*es B to E do not agree precisely with the exact values given in the corre- 
sponding tables in Fisher’s article or in Wishart’s tables (cited on page 522) for the 
value of observed correlations when the samples are drawn from a universe with zero 
correlation. The differences are so slight, however, that Figures B to E are quite 
adequate for practical purposes. 

Adjustment of correlation for size of sample. Figure F may be i^ed to f^ilitate 
the calculation of adjusted coefficients and indexes of correlation, r, E, p, or P, from 
the unadjusted values r, E, p, or P. All that is necessary is to calculate the ratio 

^ , and enter the chart with that as abscissa and the observed correlation as 
71 — 1 

ordinate. Then note the value of r given by the curve which lies nearest the inter- 
section of the two coordinates, or interpolate between the two nearest curves. Thus 

with P = 0.70, n = 30, and /n = 8, = 0.759, andP = 0.57. Likewise foj 

71 — 1 

El 234 = O.SS, and n - 20; — — = = 0.842, and E 1.234 = 0.85. 

71 — 1 19 



APPENDIX 4 


LIST OF IMPORTANT EQUATIONS 


For convenience in referring to the most important of the equations 
which are introduced from time to time in the text, all numbered equa- 
tions are repeated here in numerical order. 


M. = 


SJ 


X- M, = x 

'Zx (without regard to sign) 


S = 


(Tx 


^4 


n 


n 


Cx 


Cu = 


- «? 


71 


1 

S(dF) 2 r 

^ n 1 

in 12 


/ n 

^ 7 

^ n — I 




Ox 


- nMl 

n — 1 


( 1 ) 

( 2 ) 

(3) 

(4) 

( 5 ) 

( 0 ) 

( 6 . 1 ) 

( 6 . 2 ) 

( 6 . 6 ) 


OM = — 7= 
V n 


612 


(7.1) 



LIST OF IMPORTANT EQUATIONS 


513 


_ 1 
•N/2(n - 1) 

Y = a+bX 

Z(XY) - nM^My 

2iX^) - n(M,)2 

a = My — bMx 


(7.2) 

( 8 ) 

(9) 

( 10 ) 


S(X7) - nMxMy = S(x2/) (11) 

Y = a + bX + cX^ (12) 


(With X xised for X, x for X — Mxr U for X^, u for U — Mu, equation 
[12] becomes Y = a + 5X + cU. These symbols arc used in equa- 
tions [13] to [15], inclusive.) 

(2.r2)5 4 - {I,xu)c = Xxy] 

, (13) 

(2a-M)b -[- (2v/)c = ^uy\ 

a = My- b(Mx) - c{Mu) (14) 


vv vrr 

Mx = ■ ; Mu = - ; My = 

11 n 

= ilX^ - nM'i 

Yufu = ^XU - nMxMu 

Y.ir = - uMl 

Y.xy = il.Vl' - nMxMy 



1' = a + hX + rX- + dX-^ 


(15) 


(16) 


(With U for A’", T for A''*, ('(iiiulion [16] Ix^coines 1’ = a + hX -f- cU -[- 
<IV. These symbols an' iisc'd in ('(iiiatioi\s [17] to 119], inclusive.) 

(2i.r‘)/) -f (^.ni)r + (ll.r(’)d --= ^xy' 

{XxiOb + Ciirle + (:i(«;)d = :^uy (17) 

(2i.n')/> + (w«e)r -f {'^v^)d == Xvy 

a = My - HMx) - r(Mu} - d{M „) (18) 



514 


APPENDIX 4 


Mv = 

Suu = 
"Zxv = 
= 

Ivy = 

q2 _ 
^^y.x “ 


n 

'Z,UV - nMuM^ 
SZy - nMJi^ 
Sy2 -nMl 

syy - nMMy 

2 22^ 
tr! = — 

n 




yJix) 

q2 _ 
^y.x 

q2 _ 

SyJix) ' 
^y.Kx) 


2 ^ 


nSE 


n — 2 


- 2 


= O', 


n 


n-2 '^>-2 

na-^z'r _ 

n — m n — m 




{--) 


vx 


Pj/o: 

dx'u 




w = 1 _ 7.2 

''“ rc'fv -*■ ' a : 


'^xy 


2 

Pj/a: 


• '}ir. 


A' byx^xy 


= 1 - (1 - 4 ) 




-2 

Pya: 




(19) 


( 20 . 1 ) 

( 21 . 1 ) 

( 21 . 2 ) 

( 22 . 1 ) 

( 22 . 2 ) 


(24.1) 

(24.2) 

(24.3) 


(25) 


(26) 



LIST OF IMPORTANT EQUATIONS 


515 


S(jy) - nM^My 


2(X7) - nMj^My 

S(xj/) 

mrl 

II 

S(X7) - nM^My 

_ 2(rry) 

7l(Tx^y 

na^y 

^ /S(y2) - n{My? 

(1 - 4) 

^ ti- 1 


^2 1 
Put. = 1 


n — 1 
n — m 


Xi ~ a + 62-^ 2 + + . . . bnX. 

Xi = a + Z)2A"2 + ?>3A^3 
^0-2)62 + ^(X 2 x^)b, = S(a:ia:2)' 

S(.r 2 .n.)?M + 3 (.ri)f )3 = S( 2 -ia- 3 ), 

(I = Ml — h^M 2 — h-^Ms 


Xi = </ 


(27) 

(27.1) 

(27.2) 

(28) 
(29) 



X[ 

= a + 

i^2 A 2 

+ ?>3A’^3 


(33) 


Z 

= Xi ■ 

- A’l 



(34) 


Xi 

= (ii2: 

\ + b] 

12.3 A 2 H" bi^oX^ 


(35) 


14 2 

+ 

21 A a 

+ 1-1.2 3 A 4 


(36) 

<'i.2:)-ir> + ^'12 

.2 

‘2 + f>l 

3.2-ir>- 

^'3 + bu>2MX^ + hi, 

5.234X5 

(37) 

:^(A)Iu2m 

-1- 

:i(.<'2.<:. 

t)?^13.2 

4 + ^(.r23’4)/>i.i.2a = 

S(.r,X2)j 



44 -1- 


>13.2-1 

+ ^Cr:{.r.|)?>i4.23 = 

^(■'■i.ra) 

(38) 


44 -1- 


\)bl’A.'2 

l-l + ^ 0*4 -1.23 “ 





Ml - by.. 

; 4 ,|A /2 — hyi 

24/1/3 — 

?>M. 23 A/.i 

(39) 

+ 

^(.r 2 .r 3 ) 5 ,:, 

. 24,5 + ^(-'^ 

X.\)h\ .1 .2;j 

IT) 




H- :i(.r2 

.rp,)/>i Pj 2; 

$.1 == :::(:ri.r2) 


-f 

2 (,r5)?),n.2.: 

1.5 + ^(r;, 


tn 

■ (40) 



4- Ur, 

;,r.p,V)| p) 2: 

M == 2(.ri.r3) 




516 


APPENDIX 4 


<^1.2346 = ” ^>12.345-^2 ”” ^13.245-^3 ■“ ?>14.235^4 ?>15.234-^5 (^1) 


— 2 ^^21.234 

>^1.234 = 

71 — m 


JS(a:f) -[6i 2 .34 . . . n(Sa:iX2) + ^13.24 . . . n (Sri3-.0l 


q2 

>^1.234 . . . n — 


I 


+ • . ■ + ?>ln.23 . . . (n~l)(S.ri3’n)] 


n — m 

-X'l(234) = <^1.234 + 5x2.34^2 + ?>13.24-X"3 + 5x4.23^4 

0*1(234) 


fix .234 = 


0-x 


(42) 

(43) 

(44) 

(45) 




234 ...n— 


?512.34 . . . n(2^^1^2) + ^13.24 . . . n(^XiXs) + . . . 
+ &ln.23 . . . (n-l)(S.Ti;rn) 


'(^l) 


J?1.234 . . . n = 1 ” (1 -^1.234 . . . n) 




234 ... n 


Si. 


234 . . . n 


(~) 
=‘-r^)(“) 

= erf (1 — El, 234 . . . n) 


^2 1 ^ — ■^1-234 

ri4.23 = I 52 

i — jti:i.23 

r2 2 

012.340*2 


-2 

12^34 


^2. 


^^12.340*2 + oj(l — E 1 . 234 ) 

X ^2 


34 — 0X2.34 ■ 


Ol 


Multiple correlation squared 
of (Xx ■“ 5x2.34X2) with X3 
and X4 


- 1 - 


0?(1 - E?.234) 


O'! — 25x2 . 


34' 


“'‘'1*^2 


(46) 

(47) 

(48) 

(49) 

(50) 

(51) 

(52) 
(5;i) 


- 1 + 5 ^ 2 . 340*2 


Xi = a' +/ 2 (A’’ 2 ) fziXs) -\-fiiX 4 ) + . . . (54) 

Xi = a + 63X2 + ?)2'(X2) + 53X3 + h's'{X:i,) ] 

, (55) 

+ 64 X 4 + 64 ^X 1 ) J 



LIST OP IMPORTANT EQUATIONS 

Xi = a + b2(X2) + 1)2'(X2) + h2'>(Xf) + isiXz) + bz'(X^) 
+ h"iXl) + h(X^) + b4-(X|) + hu(Xl)\ 

Xi = ai.234 ^foiXz) +faiX3) +/4(X4) 

S[ fziXz) + faiXz) + /:(X4)] 

tti .234 — Ml 

n 

z’' = Xi - Xi 

Xi = F2(X2) = /2('^2) "" '^^/(2) + Ml 
Xi = ^2(^2) = f2(X2) — Mf(j2) 

Xi = F 2 fe) + /^ 3 (-Xa) + ^4(^4). • ■+ Fn{Xn) 

^l./(2,3,4) — ^Si./(2,3.4) 

2 

?y2 ^2:i./( 2.3.4, etc.) 

oi,/(2,3,4. etc.) = T~ 

1 — m/n 

2(2") 

Ol./(2,3,4. etc.) = 

n — m n — m 



Vyx 


=4 




m,, 


Sy.x 

^ “" 7" 

(^x'Vn 

= Vn 


a,,’ = \4a'i, , + (iT/,„a-)' 


517 

(56) 

(57) 

(58) 

m 

(60) 

(61) 

(62) 

(63) 

(64) 

(65) 
( 66 . 1 ) 
( 66 . 2 ) 
(66.3) 

(67) 

( 68 ) 

(69) 

(70) 
(70.1) 



APPENDIX 4 


1 — 

- 1 ^ , for large values of n 


t = 


\/n — 2 
r\/n — 2 

V 1 — 


: , for large values of n 


's/n — m 


1 - Rl. 


^Ri*284 . . . H ^ / 

V n — 


,234 ... 71 


, for large values of n 


m 


(71) 
(71.1) 

(72) 

(73) 


0^6i 2.84 . . . n \ 


>Sl.234 . . . n 


ncr2(l — ^2.34 . . . n) 


Ri. 234^ n — m 

Vi - /ei234 


J 




<rlnu 



^ >!5’i./(2,3.4)'^^^2 


■" Pi. 34 ) 


<^/(X)-aXm) = where fc = 

Thu 



, where /c' = 
Thu 

^^l./(2,3,4)'W 


0 - 2(1 — P 2 . 34 ) 


(74) 

(74.05) 

(74.1) 

(74.2) 
(74.11) 
(74.21) 


(Tr-r = a- My' + + ^.x 

Y = Y'±t, 7 y'.y 


( 75 ) 

(76) 


‘’ii-234— “ '^'l.234|^l "I" ^ ^222^2 + ^33^:3 + Ci^x\ 

+ 2023X2X3 + 2C24X2X4 + 2034X32:4 

(2x1)022 H” (2X2X3)023 4“ (2X3X4)024 = 1 

(Sx 2.T3)C22 “I" (2X3)023 4“ (21X3X4)024 = 0 - 

(SX2.X4)022 4" (2X3X4)033 4“ (2X4)024 = 0 


(77) 

(78) 



LIST OF IMPORTANT EQUATIONS 

(2x1)032 + ( 2 X 2 X 3)033 + ( 2 X 2 X 4)034 = O' 

(2X2X3)032 + (2X3)033 + (2X3X4)034 = 1 ■ 

( 2 x 2 X 4)032 + ( 2 x 3 X 4)033 + ( 2 x 4)034 = 0 


519 

(79) 


( 2 x 2)042 + ( 2 x 2 X 3)043 + ( 2 x 2 X 4)044 = 0 

( 2 x 2 X 3)042 + ( 2 x 3)043 + ( 2 x 3 X 4)044 = 0 

( 2 x 2 X 4)042 + ( 2 x 3 X 4)043 + ( 2 x 4)044 = 1 

1 H h (O2X2 + C3X3 + . . . +OnXn)^ 

In J 

on condition that (C 2 O 2 ) = C 22 , C 20 n = 02 n, etc. 

(^ + ^) + [standard error of /(X) - KXm)? (82) 


2 — 

^j i.23 . . . n—^1 ^1-23 . 


(80) 


(81) 




‘S'?./( 2 ;m) (^1 + -J + <r%(Xi) + o'hiXt) + 

A’l = /(A’2, Xa) 

X'l = / 2 , 3 (X 2 , X 3 ) +/ 4 (X 4 ) 

Xi = /2,3(A'2) X’^ 3 ) +/4,r)('X4, A'd) +/ 6 (X 6 ) 

Xl =/(X2,A'3,A'4,...X„) 

z'"' = (l■z.2Si + ^>52.34X’^2 + b^;f,24X;i &z4.23X4 

A”^! = ai.3'3'4' + ^> 12 ' .3' 4' ifi (A' 2 )] + f>l3'.2'4' [/:! (X 3 )] ] 

+ ^>14'.2'3' [/l (A' 4 )]J 

x'r = dixT) 


(83) 

(84) 

(85) 

( 86 ) 

(87) 

( 88 ) 


XT' = e[« +/2"(X2) +y^;'(X3) +/';'(X4)] 

Aj = a + ("A 3 + (/(AoAs) 

Ai = ct + f'As + ^(AaAa) d- /(.(Aa) 

Al = «H"/2(A2) + /2,3(A2X'^3) d-^^Aa) 

Xl =/ 2 (X 2 ) +h{X.,) d-/2+3(X2 + X 3 ) 

Xl = f2(X2) d'/3(Ad)) + /2+3^“ d- (06 


(89) 

(90) 

(91) 

(92) 

(93) 

(94) 

(95) 



520 


APPENDIX 4 


^1 = 12 (^ 2 ) + +f4:i^i) 


+f2-s^ ^ ^ V (: 

\ 0-2 0'3 ^^4 / \ 


>Si./(23 


0-2 



<rz 

0-4 / 


Xs 


<T2 

<^3 

0*4 / 

Pi. 23 


n 


n 


. n; V 

\n — 

J 

(dF)! 

2 


n _ 




^fcl2.34 ” Si.234\/^ 
^*> 13.24 ~ >§ 1 . 234 ^/^ 
^*>14.23 ” >Si.234\/ C44 


n — m 

^12.34(^x13:2) 1 r /?! .234 I 

. S(a^) J IrUsJ 



(97) 

(98) 

(99) 

( 100 ) 

( 101 ) 

( 102 ) 

( 103 ) 



APPENDIX 5 


GLOSSARY 

The Greek letters used as symbols in this text, and the most impor- 
tant other symbols, are as follows: 

5 (small delta) = coefficient of average deviation. 

(T (small sigma) = coefficient of standard deviation. 

S (capital sigma) = sum of the items specified. 

n (Latin) = number of observations in a sample. 

h (Latin) = coefficient of regression. 

/ ( ) (Latin) = function of the variable in the parenthesis. 

r (Latin) = coefficient of correlation, 

p (small rho) = index of (curvilinear) correlation. 

5 (Latin) = standard error of estimate. 

m (Latin) = number of constants in the regression equation. 

z (Latin) = residual, or difference between observed and 

estimated values of a despondent variable. 

R (Latin) = coefficient of multiple correlation. 

j8 (small beta) = beta coefficient of regression, in terms of 

unit standard deviations. 

P (capital rho) = index of multiple (curvilinear) correlation. 

T] (small via) = correlation ratio. 

6 (small theta) == function of (used here for the Bruce adjustment 

function). 

A (capital delta) = arbitrary symbol. 

X (small in) — arbitrary symbol. 

<I> (capital 'i)ki) = function of. 

A", Y (Latin) = variables, as observed. 

O', y (Latin) = variables, in terms of dc^partures from their 

means. 

d (Latin) = coefficient of d(d.(irmination. 

k (Latin) = co(jfficient of alienation. 


621 



522 


APPENDIX 5 


REFERENCES 

Excellent bibliographies covering the basic development of the theory of statistics 
are given in G. U. Yule and M. G. KendalFs Introduction to the Theory of Statistics, 
and in the special study, Studies in the History of Statistical Method, by Helen M. 
Walker (Williams & Wilkens Co., Baltimore, 1929). In addition, a brief hst of 
references covering especially articles on the theory of sampling is given in 
R. A. Fisher’s Statistical Methods for Research Workers. No attempt will be made 
here to repeat these bibhographies; instead, the student of statistical theory is 
referred to the sources mentioned. 

Articles used as the basis for specific points have already been cited at various 
places in this book, particularly in Chapters 22 and 23. In addition to those, the 
methods discussed which go beyond most statistical textbooks are based upon the 
following technical articles: 

Beandt, a. E. Use of machine factoring in multiple correlation, Jour. Amer. Stat. 
Assoc., XXIII, p. 291. September, 1928. 

Ezekiel, Moedecai. A method of handling curvilinear correlation for any number 
of variables, Quart. Pub., Amer. Stat. Assoc., Vol. XIX, pp. 431-453. Decern 
ber, 1924. 

, The assumptions implied in the multiple regression equation. Jour. Amer. 

Stat. Assoc., Vol. XX, pp. 405-408. September, 1925. 

, The determination of curvilinear regression “surfaces” in the presence of other 

variables. Jour. Amer. Stat. Assoc., Vol. XXI, pp. 310-320. September, 1926. 

, The application of the theory of error to multiple and curvilinear correlations, 

Proceedings Amer. Stat. Assoc., pp. 99-104. March, 1929. 

, A first approximation to the sampling reliability of multiple correlation curves 

obtained by successive graphic approximations. Annals of Mathematical 
Statistics, Vol. I. September, 1930. 

Mendenhall, Robeet M., and Richaed Waeeen. The Mendenhall-Warren- 
Hollerith correlation method. Columbia XJniv. Stat. Bur. Doc. 1. 1929. 

Mills, Fredeeick C., The measurement of correlation and the problem of estima- 
tion, Quart. Pub., Amer. Stat. Assoc., pp. 273-300. September, 1924. 

Schultz, Henry. The standard error of a forecast from a curve. Jour. Amer. Slat. 
Assoc., pp. 139-185. June, 1930. 

Smith, Bradford B. Forecasting the acreage of cotton. Jour. Amer. Stat. Assoc., 
pp. 31-47, especially footnotes on pp. 41 and 42. March, 1925. 

■ — — , The use of punched card tabulating equipment in multiple correlation problems. 
Bur. of Agri. Econ., mimeographed report. 1923. 

, Correlation theory and method applied to agricultural research .... Dept, of 

Agr., Bur. Agr. Econ., mimeographed report. August, 1926. 

Tolley, H. R., and Mordecai Ezekiel. A method of handling multiple correlation 
problems. Quart. Pub., Amer. Stat. Assoc., pp. 994-1003. Dec., 1923. 

and , The Doolittle method for solving multiple correlation equations 

versus the Kelley-Salisbury “iteration” method. Jour. Amer. Stat. Assoc., pp. 
497-500. December, 1927. 

Wallace, H. A., and George W. Snedecor. Correlation and machine calculation, 
Iowa State College Bui. 35. 1925. 

WiSHART, John. Table of significant values of the multiple correlation coefficient. 
Quart. Jour. Royal Meteorological Society, pp. 258-259. July, 1928. 



INDEX 


Abscissa, defined, 10 
Adjustment for number of observations, 
Ece Number of observations, ad- 
justment for 

Alienation coefficient, 494 
Allen, R. H., refs, on supply analysis, 
441 

Allport, Gordon W., ref., 440 
Apple prices, as illustration of joint 
functions, 393 

Arithmetic average, standard error of, 
19-22 

See also Average, arithmetic 
Asparagus prices, illustration, 425 
Assumptions, in sampling, 15 
on free-hand curves, 109, 152, 224, 278 
Auto-stopping, as illustrative probhun, 
42 

Avc'vago, arithmetic., defiiu'd. 2 
standard error of, 19 
Average deviation, defined, G 
Averages, in determining functional re- 
lation, 47 

barti'ls Tec.lini(iue, foot.not.(L 355 
IhoAN, boms Ib, iicknowh'dgnieiit to, vii 
ref. on cot.t.on i)rices, -13S 
on graphic c()rn‘lation, 4i{9 
on graphic e.s. ruathemaiica,! meth- 
ods. no 

on potato and cotton praa’S. 439 
on i)ro(lu(4ion r(‘spons('. 439 
on short-cut. niet.hod, 2()8. 300 
on voting, 411 
Been, Richard O., cit.ed, 319 
ref. on standard error, 358 
Benner, Claude b., ref. on egg prices, 
439 

Bekcaw, bouisE 0., ref. on price anal- 
yses, 440 


Beta coefficient, defined, 159 
in multiple correlation, 217 
Bias, in sampling, 28, 370 
Birge, Raymond T., ref., 24, 488 
Black, John D., influence of, x 
ref. on creamery costs, 439 
on graphic vs. mathematical meth- 
ods, no 

on input-output, 437 
on short-cut method, 296, 300 
Bowley, Arthur L., ref., 444 
Brandt, A. E., ref. on computation 
method, 522 
Bruch, Donald, ref., 414 
Bruce Adjustment, 403 
Buroiiu of Standards, 34 
Burmeister, Gustave, ref. on potato 
yields, 437 

Casskls, J. M., ref. on supply analysis, 
441 

CilAMBEULAIN, El) WARD, 441 
Ciiateieij), Charlotte, ref. on beef com- 
liosition, 438 

CiiAUNCKY, Marlin R., nJ., 440 
(3iock sum, 461 

CiiKRNiACK, Nathan, cited, 424 
ref. on wahu’uielon prices, 438 
Cliildren’s clothes, size standards for, 
434 

Chiss intcu'vsil, (hJined, 5 
Coding, in fitting logiirithms, 95 
in fitting parabolas, 84 
niathenial.ical ('IT('ct of, 400 
CJ^ellieient of alienation, d(‘fino(l, 139 
Coefiicifuit of eorrelation, >SVc Correla- 
tion coefficient 

Coefficient of determination, defined, 
139 

Coefficient of multiple correlation, 210 



524 


INDEX 


Coefficient of non-determination, de- 
fined, 140 

Coefficient of partial correlation, See 
Partial correlation coefficient 
Coefficient of regression, See Regression 
coefficient 

Coleman, D. A., ref., 438 
Computation methods, 455 
for standard deviation, 6, 12 
steps in, 449 

Contours, for joint regressions, 404 
Coordinates, defined, 10 
Corn yield, example of forecasting, 255 
illustrating sampling, 17 
multiple curvilinear regression, 225 
Correlation coefficient, adjustment for 
size of sample, 510 
compared to other measures, 160 
computation methods, 455 
defined, 137 

practical equation for, 149 
standard error of, 318 
Correlation coefficients, reliability of, 
504 

Correlation equations, proof of formu- 
las, 491 

Correlation index, defined, 138 
practical equation for, 156 
standard error of, 320 
Correlation measures, for short-cut 
linear method, 276 
mathematical meaning of, 493 
meaning of, 159 

Correlation methods, place in research, 
442 

Correlation ratio, defined, 306 
Correlation results, meaning of, 450 
Correlation table, 456 
Correlations, 136 

Cotton yield, illustrating standard er- 
ror of regression curve, 335 
study of fertilizing, 417 
used in illustrative problem, 147 
Court, Andrew T., illustration from, 
374 

ref. on joint functions, 414 
Court Method, for joint functions, 409 
Cowan, Donald R. G., ref., 440. 


CowDEN, Dudley J., refs., 81, 93, 189, 
445 

Cross classification, for many variables, 
186 

for three variables, 181 
Croxton, Frederick E., refs., 81, 93, 
189, 445 

Crum, W. L., ref., 444 
on joint costs, 440 
Cubic parabola, method of fitting, 89 
Curves, equations for, 76, 79 
logarithmic. See Logarithmic curves 
Curvilinear correlation, interpretation 
of results, 157 
practical methods for, 152 
standard error of forecast, 345 
Curvilinear functions, illustrations of, 
75 

Curvilinear multiple correlation, mea,s- 
ure of, 264 

standard error of estimate for, 259 
Curvilinear relation, fitted by free- 
hand curve, 105, 152, 222, 277 
by mathematical curves, 83, 89, 93, 
121 

Dairy cows, studies of feeding, 417 
Davis, Floyd E., ref., 413, 414 
Dean, Joel, ref. on cost (airves, 441 
Degrees of freedom, reduction in, 142 
Deming, W. Edwards, acknowledgment 
to, vii 
ref., 24 

on theory of errors, 488 
Dependent and independent varial>los, 
functions for, 58 
straight lines for, 74 
Dependent variable, defined, 50 
mathematical representation of, 59 
result of selecting values of, 361 
Derksen, J. B. D., ref., 441 
Determination coefficient, 494 
Determ inantal solution, for normal 
equations, 478 

Deviation, average, See Average devia- 
tion 

standard, See Standard deviation 
Diedjens, V. A., ref., 439 



INDEX 


525 


Differential regressions, 413 
Digiting, 459 
Dixon, H. B., ref., 43S 
Doolittle, M. H., 477 
Doolittle Method, 200, 464 
Dot chart, method of constructing, 36 
Dwyer, P. S., ref. on computation meth- 
ods, 478 

Education, use of correlation in, 429, 
434 

Egg price problem, illustrating non- 
quantitative variable, 302 
Elliott, Foster F., ref. on correlation 
method, 500 
on hog production, 439 
Equation, for straight line, 59 
linear, moaning of constants in, 60 
parabolic, nic'uning of constants in, 
118 

selecting tlio typo, 447 
Equations, for difforont types of curves, 
76, 79 

limiial ions to, 102 
list of, 512 

Error, probable, l^cc Probable error 
Error of est imate, Sea Standard error of 
estinialo 

Errors of ol:)sei'vat,ion, in both variables, 
306 

in (lependcnti variable, 365 
in indepiuideni variable, 366 
Estimate, See l*’'oi‘(’ciist 
Estimating deiu'iKh'nt variable, limita- 
tions on, 125 

Extrapolation of n’gression equations, 
347 

Ezekiel, Mordecai, data from, 278 
refs, on correlation nadhods, 110, 522 
oil dairy fanning, 437 
on error of n'grcssioii coidlicii'iits, 
327 

on farm(;rs’ (airnings, 437 
on hog ])ric('s, 438 
on iiquit-oiil ]ml:, 437 
on lamb price’s, 438 
on milk ])ro(hicl ion, 411 
on pric(; analysis, 439 


Ezekiel, Mordecai, refs, on short-cut 
method, 300 
on steel costs, 441 
on tobacco farming, 438 

Factors, methods of measuring, 444 
Falling body, illustrating a mathemati- 
cal function, 114 

Farm income, illustrating multiple cor- 
relation, 169 

related to farm organization, 421 
Farm values, example of multiple rela- 
tions, 164 

Fellows, H. C., rof., 438 
Fisher, R. A.., cited, 10 

on standard error of r, 319 
ref., 93 

on analysis of variance, 189 
on error of multiple correlation, 
323 

on sampling, 522 
on staindard errors of b, 470 
on nso of dilTenuitial regression, 419 
on wheat yield, 438 
statistical contribulions of, x 
Foote, Richard J., ref. on short-cut 
method, 300 

I'kirccjist, changi^ in univcrsi’, 31 

from multiido correlation, standard 
error of, 344 

from simple coirclal.ion, stamlord 
error of, 342 

pra<4.icjil procc^luri’S for, 350 
Forecasting corn yiidds, illustrating 
nu’thod, 255 

Freehand e.iirvin as m(*th()<l of expn’ss- 
ing relation, 105, 152, 222, 277 
assumptions for, 109, 152, 221, 27S 
Frekman, Frank S., ivf. on iiiidligenci! 
tests, 440 

h'napK’ncy distribution, chart, of, 11 
l^napieney tabN', (h'liiu’d, 3 
Frisch, H,A(jnar, rd'., 448 
Function, (h'llncd, 39 
liiM'ar, assiimpl.ioiis of, 73 
FmicLional r(4ation, giaipliie illustration 
of, 38 

mathemat icjil (’xpression of, 39 



526 


INDEX 


Eunctional relation, statistical measure- 
ment of, 42 

use of averages to determine, 47 
Funk, W. C., ref., 437 

Gabriel, Harry G., ref. on egg prices, 
439 

Gallup poll, 435 

Gans, a. E., ref. on milk elasticity, 439 
Garrison, K. C., ref. on intelligence 
tests, 440 

Girschick, Meyer A., acknowledgment 
to, vii 

cited, 301, 477 
ref. on clothing sizes, 441 
Glossary, 521 

Goodenoxtgh, F. L., ref. on intelligence 
tests, 440 

Goodrich, C. L,, ref., 437 
Gosnell, Harold, ref. on politics, 441 
Gowen, John W., refs, on dairying, 439 
Graphic chart, for adjusting observed 
correlations, 510 
for small samples, 505 
Graphic charts, for reliability of ob- 
served correlation, 506-9 
Graphic correlation, standard eiTor of, 
327 

Graphic curve, fitting freehand, 105 
limited by logical conditions, 109 
Graphic curves, See Hegression curves 
and Multiple regression curves 
Graphic interpolation, 87 
Graphic method, 222, 268 
auxiliary graphic processes, 479 
Graphic regression curves, adjustments 
by least-squares, 401 
Gravity, illustration of mathematical 
function, 114 

Greek letters, meaning of, 21 
Guthrie, Edward S., ref. on creamery 
costs, 439 

Haas, G. C., cited, 415 
ref. on hog prices, 438 
on land prices, 437 
Hafstad, L. R., ref., 355 
Hainsworth, R. G., acknowledgments 
to, vii, xi 


Hanau, Arthur, cited, 423 
ref. on hog prices, 438 
Hardenburg, E. V., ref. on potato 
yields, 437 

Harper, F. A., ref., 186 
Hay-stack problem, illustrating joint 
correlation, 376 
Hedden, W. P., cited, 424 
ref. on watermelon prices, 438 
Higbie, Edgar Creighton, ref., 440 
Hog prices, illustrating a mathemati- 
cally fitted equation, 121 
Holdaway, C. W., ref. on dairy farm- 
ing, 437 

Hole, Erlinq, ref. on dairying, 441 
Horse feed, illustrative problem, 129 
Hosterman, W. H., data from, 378 
Hotelling, Harold, ref., 316 
Howe, Charles B., data from, 302 
ref. on egg prices, 439 
Hull, Clark L., ref. on prediction 
formulas, 440 

Huxley, Julian, quotation from, 1 
Hyperbolas, characteristics of, 80 
Hypothesis, checking against relations, 
452 

method of developing, 443 

Independent variable, defined, 50 
mathematical representation of, 59 
result of selecting values, 360 
Index of correlation, See Correlation 
index 

Index of determination, defined, 140 
Index of multiple correlation, defined, 
264 

for joint correlation, 387 
from short-cut method, 293 
Individual forecast, standard error of, 
341, 469 

Industrial commodities, cost curves for, 
434 

prices of, 434 

Inference, statistical, See Statistical in- 
ference 

Input-output relations, studit's of, 416 
Intelligence, components of, 434 
Interpolation, graphic, 87 



INDEX 


627 


Interpretation of correlation results, 71, 
151, 157, 254, 450 
Interquartile range, 4 
Irrigation water, illustrative problem, 
147 

Ives, J. Russell, ref. on short-cut 
method, 300 

Jensen, Einar, ref. on milk production, 
441 

Jerome, Harry, ref., 444 
Johnson, Sherman E., ref. on dairy 
farming, 437 

Joint correlation, for n variables, 391 
Joint function, for three variables, 376 
mathematical functions, 407 
Joint regression, 372 
Joint relations, identifying by short- 
cut method, 296 

Kantor, Harry, ref. on peach prices, 
438 

Kelley, Truman L., cited, 150 
refs., 282, 494 

Kend.all, M. G., refs., 189, 522 
Kifer, R. S., ref. on dairy fanning, 437 
Killouoh, Hugh B., ref., 103 
on oat prices, 438 
Koon, R. M., ref., 439 
KnorMANS, T., ref., 441 
Kuhrt, W. J., ref. on wheat iiriccs, 439 
Kyle, .CoRRiNE F., ref. on wheat protein, 
438 

Land values, 415 

Least squares, derivation of normal 
equations, 489 
in fitting straight lino, 64 
note on, 67 

Lettuce price, as illustration of hypoth- 
esis, 443 

Line of best fit, definod, 67 
Linear correlation, intorpretation of re- 
sults, 150 

practical methods for, 147, 455 
Linear equation, intorpretation of, 71, 
151 

meaning of constants in, 60 


Linear net regressions, by short-cut 
method, 269 

Linear regression equation, 145 
for three variables, 191, 198 
Literary Digest poll, 435 
Logarithmic curve, method of fitting, 93 
Logarithmic curves, characteristics of, 79 
Logarithmic equation, fitted to hog 
prices, 121 

for minimizing absolute departures, 
99 

Logical conditions, on free-hand curve, 
109 

on multiple regression curves, 224, 278 
on regression curves, example of, 152 
Logical significance, of mathematical 
functions, 113 

McNall, P. E., ref., 327 
on dairy farming, 437 
Malenbaum, Wilfred, ref. graphic vs. 
mathematical methods, 110 
on short-cut method, 296, 300 
on supply analysis, 441 
Market prices, studies of, 422 
Mathematical curves, limitations to, 102 
Mathematical equations, for joint func- 
tions, 407 

in an economic problem, 121 
when to fit, 120 

Mathematical functions, for net regres- 
sion curves, 396 
logical significance, 113 
Mathematical net regression curves, 
method of fitting, 221, 396 
standard error of, 339 
Mean, arithmetic, defined, 2 
standard error of, 19-22 
Median, defined, 4 

Mendenhall, Robert M., ref. on com- 
putation methods, 522 
Mensbnkamp, L. E., ref., 439 
Miohell, R. L., refs, on supply analysis, 
441 

Mills, Frederick C., ref., 92 
on correlation method, 522 
Minor, W. A., Jr., ref., 440 
Mises, Richard von, ref., 350 


528 


INDEX 


Misneb, E. G., data from, 225 
ref. on dairy rations, 437 
on weather and prices, 437 
Moore, Henry L., cited, 422 
ref. on cotton forecasting, 438 
on cycles, 437, 438 
Morrison, F. B., ref., 327 
on dairying, 437 
Muir, James C., data from, 147 
Multiple correlation, affected by errors 
of observation, 367 
by successive elimination method, 169 
computation methods, 459 
theoretical illustration of, 164 
Multiple correlation coeflS-cient, 219 
reliability of, 504 
standard error of, 321 
Multiple curvilinear correlation, index 
of, 264 

standard error of, 327 
Multiple curvilinear regressions, by 
mathematical equations, 221, 396 
by short-cut method, 277 
by successive approximations, 222 
Multiple determination, coefficient of, 
211 

Multiple regression curves, graphic, by 
successive approximations, 222 
limitations on use, 254 
mathematical, 221, 396 
stating the conclusions, 247 
Multiple regression equation, defined, 
165, 196 

interpretation of, 205 
Multiple relations, examples of, 163 

Net regression coefficient, defined, 165, 
196-7 

standard error of, 321 
Net regression coefficients, computation 
method, 459 
meaning of, 202, 205 
standard error for, 321, 469 
Net regression curves, by mathematical 
functions, 221, 396 
standard error of, 327, 339 
Non -determination coefficient, 494 
Non-quantitative variable, 302 


Normal distribution, description of, 9 
Normal equations, alternative methods 
of solving, 477 
derivation of, 496 
method of solving, 464 
Number of observations, adjustments 
for, 133, 141, 209, 211 

Objective, method of stating, 442 
O’Brien, Ruth, ref. on clothing sizes, 
441 

Observation equations, defined, 61 
Observed range, extrapolation beyond, 
347 

Ordinate, defined, 10 
Orthogonal polynomials, in fitting 
trends, 93 

Pallbsen, J. E., refs., 413, 414 
Parabola, characteristics of, 78, 80 
cubic, See Cubic parabola 
equation for, 76 
method of fitting, 83 
Parabolas for time series, method of 
fitting, 92 

Parabolic equation, moaning of (‘on- 
stants, 118 

Part correlation, coefficient of, 213 
(footnote) 
derivation of, 497 

Partial correlation, coefficient of, 213 
method of computing, 474 
Partial correlation indexes, 267 
Partial regression coefficient, 197 
Patton, Alson C., ref., 444 
Patton, Palmer, ref. on crop yields, 437 
Pearson, F. A., ref. on prices, 439 
Perpendicular regressions, 367 
Pettit, Edison, ref., 438 
Physical characteristics, studies of, 420 
Political behavior, siiidies of, 435 
Pond, George, ref. on dairy farming, 437 
Potato yield problem, illustrating joint 
correlation, 407 

Practical correlation methods, 146, 442, 
455 

Practical procedures, for c.orrelation 
computations, 146, 455 
for reliability of forecasts, 356 



INDEX 


529 


Price analysis, examples, 422 
simplified illustration, 124 
Prices of farm products, 422 
of industrial commodities, 4S4 
influence on production, 428 
Probable error, 26 
Product moment, defined, 149 
Production, response to price, 428, 430 
Projectile, trajectory of, 118 
Protein content, wheat illustration, 82 
Psychology, use of correlation in, 429, 

434' 

PuRVES, Clarence M., acknowledgment 
to, vii 

Quadratic equation, illustration of, 114 
Quartiles, lower and upper, defined, 4 

Raeburn, John R., ref., 393 
Ratio, correlation, l^ee Correlation ratio 
Regression (toefTicieni, compared to 
other measures, 161 
comiiutation methods, 455 
defined, 146 
gross, 1(55 
net, 165, 196 

practical equation for, 148, 149 
standard error of, 312 
Regression curv(‘, dcfiiUMl, 146 
Regri'ssion curves, limitations on, See 
Logical limilalions on regrc'ssiou 
curves 

Ri'gn'ssion c(iuation, defined, 146 
(‘xi.rapolat ion of, 347 
for n V!irial)lcs, 203 
Set' also Janear and niultiph^ r('gres- 
sion ('(lualion 

R(‘grcssi()n line, defined, 146 
standard (M'ror of, 315 
Reliability, Xce Slaudard error 
RieiiKY, I’uKDKKK’K D., cilcd, 429 
r('f. on corn bia'eding, 439 
Rons, (’. ref, on .autom()l)iI(^ demand, 
441 

Rop(‘r poll, 435 

Ros.s, 11. A., ref. on milk market ing, 439 


Sample, defined, 14 
reliability of, 18 
selection of, 359 
size required, 29 
spot and stratified, 16 
Samples, small. See Small samples 
Sampling, assumptions of, 15 
bias in, 28 
of time series, 349 

Sampling theories, assumptions in, 486 
Sarle, Charles F., acknowledgment to, 
vii 

Sasuly, Max, ref., 92 
ScHOENFEU>, William A., ref. on milk 
marketing, 439 
Schultz, Henry, ref., 346 
on prices, 440 

on standard errors, 340, 358, 522 
Secrist, Horace, ref., 444 
Separate determination, 498 
Shepherb, Geoffrey, ref. on short-cut 
method, 300 

Sheppard's Correction, 12, 458 
Shollenberoeu, J. H., ref. on wheat ker- 
nels, 438 

Short-cut method of multiple correla- 
tion, 268 

auxiliary graphic processes for, 479 
refs, on, 300 

Short-cut method for multiple ciuvi- 
linear rcgi’essiotrs, 277 
Small Hjuuplo.s, (‘ori'ection for, 23 
rclijihilit.y of, 22, 504 
Smith, liiuDFoiu) R., (‘iled, 423, 497 
refs, on coi-ndntiou iiu'thods, 522 
on c(d l.on acic'ngo, 439 
on cotton and interest rates, 440 
on cotton ])riccs, 438 
on cotton yield, 437 
on i)roducti(m, 438 
Smith, (L 1*]. R., data from, 147 
Snkdw’or, (Ieohue W., nd., 189 
on coui])ulal.i()n nudhods, 522 
Si*E.\UMAN, (k, ci tud, 435 
r('fs. on factor theory, 441 
on ‘‘footrulc coi‘n4atinn,” 440 
SiuLT,M.\N, WiLM.wi ,J., citod, 152 
Spot samj)l(‘, 16 



530 


INDEX 


Standard deviation, defined, 8 
method of computing, 8 
short-cut method of computing, 12 
Standard error, 25 

of arithmetic average, 19-22 
of correlation coefficient, 318 
of correlation index, 320 
of group averages, 49, 54 
of individual estimate, method of 
computing, 469 

of mathematical regression curves, 339 
of mean, derivation of, 486 
of multiple correlation coefficient, 321 
of net regression coefficient, 321 
of partial regression coefficients, com- 
putation method, 469 
of regression coefficient, 312 
of regression line, 315 
of standard error, 24-25 
for time series, 349 
Standard error of estimate, 128 
compared to other measures, 160 
for curvilinear multiple correlation, 
259 

defined, 131 

for joint correlation, 387 

for linear multiple correlation, 208 

practical equation for, 149 

for short-cut method, 293 

for simple curvilinear correlation, 131 

for simple linear correlation, 129 

units for, 135 

Standard error of forecast, 341 
from curvilinear correlation, 345 
from multiple correlation, 344 
from simple correlation, 342 
Statistical inference, assumptions of, 15, 
486 

Steel cost problem, as illustration for 
short-cut method, 278 
as illustration of standard error of net 
regression curves, 338 
Stevens, Chester D., ref., 437 
Stine, 0. C., acknowledgment to, vii 
Straight line, equation for, 59 
fitted by least squares, 64 
graph of, 60, 61 

interpretation of equation for, 71 


Straight line, methods of fitting, 59 
Stratified sample, defined, 16 
“Student,” ref. on correlation, 504 
statistical contributions of, x 
Sturqbs, Alexander, ref., 350 
Successive approximation method, for 
net regression curves, 222 
Supplementary methods, for net regres- 
sion curves, 401 

Sutherland, H. E. G., ref. on I. Q., 440 

i-test, for correlation coefficient, 318 
for multiple correlation coefficient, 323 
Table, frequency. See Frequency table 
Taylor, Henry C., influence of, x 
Taylor, C. C., ref. on farm management, 
438 

Thomsen, F. L., ref. on prices, 440 
Thurstone, L. L., ref., 441 
Time series, error formula for, 349 
method of fitting parabolas to, 92 
Time-series analysis, examples of chang- 
ing, 426 

Tolley, Howard R., acknowledgments 
to, vii, X 
cited, 500 

refs, on correlation methods, 522 
on input-output, 437 
Traynor, Kenneth, ref., 437 
Trends, fitting to time series, 92 
Tretsven, J. 0., ref. on dairy farming, 
437 

Universe, changing, 426 
defined, 14 

Universes, past and present, 30 

Variable, dependent. See Dependent 
variable 

independent. See Independent 
variable 

Variables, relations between, 34 
units of statement, 446 
Variance, analysis of, 189 
defined, 10 

related to correlation coefficient, 403 
Vernon, J. J., ref. on dairy farming, 437 
on tobacco farming, 438 



INDEX 


531 


VON SzELiSKi, Victor, ref. on automo- 
bile demand, 441 

Waite, Warren C., refs, on short-cut 
method, 300 

Wald, Abraham, ref., 367 

Walker, Helen M., ref., 522 

Walh-^ce, Henry A., ref. on computa- 
tion methods, 522 

Warren, George F., ref. on prices, 439 

Warren, Richard, ref. on computation 
methods, 522 

Waugh, Frederick V,, acknowledg- 
ments, vii, xi 
cited, 347 

data from, 393, 407 
ref. on computation methods, 478 
on potato prices, 438 
on potato yields, 437 
on standard error, 358 
on vegetable prices, 430 

Waugh Method, for joint regressions, 
404 

Weather conditions, influence on crop 
yields, 418 

Weeks, Angelina L., ref., 410 

Wellman, H. R., ref. on short-cut 
method, 300 

Wells, O. V., ref. on dairy fanning, 437 

Westbrook, E. C., ref., 437 


Wheat, vitreous kernels, as illustrative 
problem, 82 

Wheat protein, illustrating free-hand 
curve fitting, 106 
ref, on, 438 

Wheat yields, study of, 419 
Whitcomb, W. D., ref., 439 
Winch, W. H., ref., 440 
WiSHART, John, ref. on standard errors, 
522 

WiTMER, Helen Leland, ref. on sex edu- 
cation, 440 

Wolfe, T. K., ref. on corn breeding, 4"9 
Working, Elmer J., acknowledgment 
to, vii 

ref. on short-cut method, 300 
Working, Holbrook, ref., 316 
on potato differentials, 439 
on potato prices, 438 
Wright, Sewall, ref. on correlation 
method, 500 

Wylie, Kathryn H., data from, 278 
ref., 327 

on steel costs, 441 


Yield, com. See Corn yield 
Yule, G. Udny, cited, 16 
ref., 189, 440, 522 
on error theory, 489 



